Open bunny-therapist opened 3 weeks ago
@xamgore
I have not been able to fix this in my branch because I don't understand the Occurrence datastructure (I don't know what the fields mean) and I am not good enough at rust. I am trying, but it is hard.
With the fixes I have so far in my fork, I get most results identical to LIAAD/yake, but the sample text (google and kaggle) that is used for testing only matches LIAAD/yake if I comment out the latter half of the text. If I use the full text, the results differ, and the difference is coming from TF avg and TF std, which impacts frequency and relatedness. I compared the valid TFs used to calculate TF avg and TFG std between LIAAD/yake and yake-rust, and found this discrepancy.
let occurrence = Occurrence {
// the word itself
word
// sentence index (0, 1, 2, ...), where the occurence is
idx,
// the total number of words in all previous sentences
shift,
// ordinal number of the word in the source text after splitting into sentences
shift_offset: shift + w_idx,
};
assertion `left == right` failed
left: [ResultItem { raw: "this", keyword: "thi", score: 0.1583 }]
right: [ResultItem { raw: "keyword", keyword: "keyword", score: 0.1583 }]
🫠 normalization process isn't perfect, but ok.
assertion `left == right` failed left: [ResultItem { raw: "this", keyword: "thi", score: 0.1583 }] right: [ResultItem { raw: "keyword", keyword: "keyword", score: 0.1583 }]
🫠 normalization process isn't perfect, but ok.
Isn't "this" a stopword?
Ah, you must be checking if it is a stopword after removing the "s".
Just to check, I just used to_single on all the stopwords as well (probably not a good idea as an actual solution) and that fixed that problem. The "thi"/"this" goes away and we are back to 3 passing tests.
Sadly, the google test still fails. I can troubleshoot that more deeply tomorrow.
@bunny-therapist The reason why this project is a disaster is precisely that Rust is hard so don't feel bad. I used this project as a means to learn it.
@bunny-therapist The reason why this project is a disaster is precisely that Rust is hard so don't feel bad. I used this project as a means to learn it.
Thanks. Yeah, rust is fair - but hard.
I've found out that stopwords file for English differs from one in LIAAD repo.
I'll switch for now, after stabilization we can get them back.
That's probable. I do remember using what I thought was a more complete list.
When I test for match with LIAAD, I always use the LIAAD stopwords, so I don't think we need to worry about them at the moment.
I have been testing the plural-normalization code, and there is definitely a bug with stopwords. You get "this" as a keyword even though it is a stopword, because you are comparing to "thi". I applied to_single on all of the stopwords, and then I got problem with the word "give", because "gives" is a stopword, and using to_single on it results in "give", and thus "give" gets removed from the final keywords. When I just manually accounted for this by not changing the stopword "gives" specifically, then the plural normalization appears to have improved the results. However, this is of course not a valid solution.
So the problem is that stopwords need to be handled with respect to the actual word, but when we calculate TFs our words need to be plural-normalized. Maybe we should just do the normalization in certain places? Like when building contexts and fetching context? I am not sure, but this is a bug.
It appears that for the google text, we are returning "competitions" as ResultItem.raw, but LIAAD/yake is returning "competition".
Comparing to ResultItem.keyword is even worse, because those are lowercase, and LIAAD/yake returns e.g. "Google".
Plural-normalizing our output does not work either, because then we output "Venture", but LIAAD/yake outputs "Ventures" (possibly because there is no "Venture" in the text?).
I am not entirely sure how LIAAD/yake decided what to output.
Currently, if I
then I at least get very close to LIAAD/yake. Except for the cases I write bug reports for here, I get identical scores etc.
LIAAD/yake has special handling of plural here: https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L148
yake-rust lacks this logic. Compare a few of the valid terms with TF (in part of the sample text used for testing, about kaggle and google) between LIAAD/yake and yake-rust.
LIAAD/yake has these extra: {(1.0, 'communitie'), (1.0, '100,000'), (1.0, 'competitor'), (1.0, 'detail'), (1.0, 'integration'), (1.0, 'source'), (4.0, 'competition'), (2.0, 'scientist'), (2.0, 'host'), (1.0, 'co-founder'), (1.0, 'rumor'), (1.0, 'video'), (1.0, 'project')}
yake-rust has these extra: {(1.0, 'details'), (2.0, 'scientists'), (1.0, 'de'), (1.0, 'rumors'), (2.0, 'competitions'), (1.0, 'sources'), (1.0, 'videos'), (1.0, 'founder'), (1.0, 'competitors'), (1.0, 'projects'), (2.0, 'competition'), (1.0, 'communities'), (1.0, 'hosts'), (1.0, '100000'), (1.0, 'integrations'), (1.0, 'host')}
As can be seen, LIAAD/yake will trim the "s" at the end of a word if it has length > 3 and ends with an "s". Therefore it has terms like "integration" instead of "integrations", and it will also count "integration" and "integrations" as the same term (I assume). This affects the result.