quesurifn / yake-rust

MIT License
3 stars 3 forks source link

Order the shorter seq as the first argument #2

Closed BVegNow closed 1 year ago

BVegNow commented 1 year ago

Following the python example, the shorter string is seq1

BVegNow commented 1 year ago

levenshtein_distance patch

Was actually trying to fix the call to levenshtein_distance, and the real fix is in a branch above waiting for a PR to be accepted in the rs-natural crate. Without grapheme support the function call gives unexpected results in yake

quesurifn commented 1 year ago

I think I fixed it in the most recent version of my crate. I went down this path and found a workaround to where the results were very close to the original Yake algo.

Let me know.

BVegNow commented 1 year ago

If I were to use the scores for ranking the same keyword results for searches that point to different source texts, should I just use the raw score output or should I normalize scores from each text from 0 to 1? This is because my scores tend to be quite high, for example using the github python usage example with ngram3 and 20 keywords, my higher keyword scores get close to 1 while the python scores remain quite low. Using other texts with rust I even get scores well over 1, so I wondering if it was better to normalize when comparing scores across source texts or is the absolute score fine. Thanks