Handle plural - Githubissues

bunny-therapist commented 3 weeks ago

LIAAD/yake has special handling of plural here: https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L148

yake-rust lacks this logic. Compare a few of the valid terms with TF (in part of the sample text used for testing, about kaggle and google) between LIAAD/yake and yake-rust.

LIAAD/yake has these extra: {(1.0, 'communitie'), (1.0, '100,000'), (1.0, 'competitor'), (1.0, 'detail'), (1.0, 'integration'), (1.0, 'source'), (4.0, 'competition'), (2.0, 'scientist'), (2.0, 'host'), (1.0, 'co-founder'), (1.0, 'rumor'), (1.0, 'video'), (1.0, 'project')}

yake-rust has these extra: {(1.0, 'details'), (2.0, 'scientists'), (1.0, 'de'), (1.0, 'rumors'), (2.0, 'competitions'), (1.0, 'sources'), (1.0, 'videos'), (1.0, 'founder'), (1.0, 'competitors'), (1.0, 'projects'), (2.0, 'competition'), (1.0, 'communities'), (1.0, 'hosts'), (1.0, '100000'), (1.0, 'integrations'), (1.0, 'host')}

As can be seen, LIAAD/yake will trim the "s" at the end of a word if it has length > 3 and ends with an "s". Therefore it has terms like "integration" instead of "integrations", and it will also count "integration" and "integrations" as the same term (I assume). This affects the result.

bunny-therapist commented 3 weeks ago

@xamgore

bunny-therapist commented 3 weeks ago

I have not been able to fix this in my branch because I don't understand the Occurrence datastructure (I don't know what the fields mean) and I am not good enough at rust. I am trying, but it is hard.

bunny-therapist commented 3 weeks ago

With the fixes I have so far in my fork, I get most results identical to LIAAD/yake, but the sample text (google and kaggle) that is used for testing only matches LIAAD/yake if I comment out the latter half of the text. If I use the full text, the results differ, and the difference is coming from TF avg and TF std, which impacts frequency and relatedness. I compared the valid TFs used to calculate TF avg and TFG std between LIAAD/yake and yake-rust, and found this discrepancy.

xamgore commented 3 weeks ago

let occurrence = Occurrence { 
  // the word itself
  word
  // sentence index (0, 1, 2, ...), where the occurence is
  idx,
  // the total number of words in all previous sentences
  shift,
  // ordinal number of the word in the source text after splitting into sentences
  shift_offset: shift + w_idx,
};

xamgore commented 3 weeks ago

assertion `left == right` failed
  left: [ResultItem { raw: "this", keyword: "thi", score: 0.1583 }]
 right: [ResultItem { raw: "keyword", keyword: "keyword", score: 0.1583 }]

🫠 normalization process isn't perfect, but ok.

bunny-therapist commented 3 weeks ago

assertion `left == right` failed
  left: [ResultItem { raw: "this", keyword: "thi", score: 0.1583 }]
 right: [ResultItem { raw: "keyword", keyword: "keyword", score: 0.1583 }]

🫠 normalization process isn't perfect, but ok.

Isn't "this" a stopword?

bunny-therapist commented 3 weeks ago

Ah, you must be checking if it is a stopword after removing the "s".

bunny-therapist commented 3 weeks ago

Just to check, I just used to_single on all the stopwords as well (probably not a good idea as an actual solution) and that fixed that problem. The "thi"/"this" goes away and we are back to 3 passing tests.

Sadly, the google test still fails. I can troubleshoot that more deeply tomorrow.

quesurifn commented 3 weeks ago

@bunny-therapist The reason why this project is a disaster is precisely that Rust is hard so don't feel bad. I used this project as a means to learn it.

bunny-therapist commented 3 weeks ago

@bunny-therapist The reason why this project is a disaster is precisely that Rust is hard so don't feel bad. I used this project as a means to learn it.

Thanks. Yeah, rust is fair - but hard.

xamgore commented 3 weeks ago

I've found out that stopwords file for English differs from one in LIAAD repo.

I'll switch for now, after stabilization we can get them back.

quesurifn commented 3 weeks ago

That's probable. I do remember using what I thought was a more complete list.

bunny-therapist commented 3 weeks ago

When I test for match with LIAAD, I always use the LIAAD stopwords, so I don't think we need to worry about them at the moment.

I have been testing the plural-normalization code, and there is definitely a bug with stopwords. You get "this" as a keyword even though it is a stopword, because you are comparing to "thi". I applied to_single on all of the stopwords, and then I got problem with the word "give", because "gives" is a stopword, and using to_single on it results in "give", and thus "give" gets removed from the final keywords. When I just manually accounted for this by not changing the stopword "gives" specifically, then the plural normalization appears to have improved the results. However, this is of course not a valid solution.

So the problem is that stopwords need to be handled with respect to the actual word, but when we calculate TFs our words need to be plural-normalized. Maybe we should just do the normalization in certain places? Like when building contexts and fetching context? I am not sure, but this is a bug.

bunny-therapist commented 3 weeks ago

It appears that for the google text, we are returning "competitions" as ResultItem.raw, but LIAAD/yake is returning "competition".

Comparing to ResultItem.keyword is even worse, because those are lowercase, and LIAAD/yake returns e.g. "Google".

Plural-normalizing our output does not work either, because then we output "Venture", but LIAAD/yake outputs "Ventures" (possibly because there is no "Venture" in the text?).

I am not entirely sure how LIAAD/yake decided what to output.

Currently, if I

hack around the plural-stopword problem
add the fix in #19, and ignore the problem with the output format
make allowances for the fact that when keeping N keywords, yake-rust and LIAAD/yake drop different keywords in the case where they have the same score (there is some arbitrary order to some vec/list, which then gets truncated)

then I at least get very close to LIAAD/yake. Except for the cases I write bug reports for here, I get identical scores etc.

quesurifn / yake-rust

Handle plural #16