Closed KinamSalad closed 4 years ago
"kr-wordrank" uses HITS (it seems like to PageRank), one of graph ranking algorithm. And the variable "r" represents the rank of the "word".
The sum of the ranks of all nodes in graph is always fixed at 100, but the sum of ranks of top-ranked words can change.
The scale of the rank differs depending on the number of nodes. Therefore, you need to calibrate the different scales by multiplying the rank value by the number of nodes.
For example,
domain1_keywords, _, _ = wordrank_extractor.extract(domain1_texts)
n_keywords1 = len(domain1_keywords)
domain1_keywords = {k:r * n_keywords1 for k, r in domain1_keywords.items()}
domain2_keywords, _, _ = wordrank_extractor.extract(domain2_texts)
n_keywords2 = len(domain1_keywords)
domain2_keywords = {k:r * n_keywords2 for k, r in domain1_keywords.items()}
I hope this answer helpful to you
I think this issues is not activate anymore. Therefore, I closed this issue.
for word, r in sorted(keywords.items(), key=lambda x:x[1], reverse=True)[:30]:
In this line, I can see the 'r' which is extracted by keywords.item. What does the r means? Total number of r is not constant value and It seems it does not match to number of vocabs.
I want to use r as indicator to compare the two results from two different domains.
Thank you