Closed xamgore closed 4 weeks ago
Hey, sorry for getting back to you almost a year later. I'll try to make this a priority in the coming days. There's no excuse for out of bounds. I haven't dealt with non-english chars, but I'm wondering if they exist in a different encoding and hence out of bounds are taking place. I'll take a look.
I encountered this error with German text as well.
@bunny-therapist Do you mind sending me some example text to reproduce?
Ok, I will try tomorrow.
@quesurifn demo showing you can't index vector by its string's length.
str::len() Returns the length of self. This length is in bytes, not chars or graphemes.
char
char
is always four bytes in size.
I call get_n_best
with n=10
text:
Dein wöchentlicher amazon Newsletter Entdecke jetzt die neuen Angebote um deinen Garten bereit für den Frühling zu machen. Auch die neusten Trends für Damen und Herren Mode sind wieder dabei.
custom stopwords attached
stopwords.txt
(I copied your code and modified Yake::new with a new Option<HashSet<String>>
argument for this, to override the default stopwords)
ngram=1, remove_duplicates=True
This crashes in the levenshtein distance function with index out of bounds: the len is 8 but the index is 8
If I replace the special characters like "ä"->"ae" (in both text and stopwords), it does not crash. However, it does score keywords differently than using another yake implementation where I can keep the special characters (I guess it makes sense the results comes out slightly different when you change the input).
Hm. I see levenshtein distance function is actually in "natural", and there is a TODO about graphemes there. https://github.com/lexi-sh/rs-natural/blob/master/src/distance.rs#L79 Should this be an issue on that project instead maybe?
Alternatively, re-implement levenshtein distance in this project - unicode-segmentation is already a dependency, so it should (?) be easy to get the graphemes (it seems unicode-segmentation does that).
@bunny-therapist could you please test with my fork? Cargo.toml:
[patch.crates-io]
yake-rust = { git = "https://github.com/xamgore/yake-rust.git" }
@bunny-therapist could you please test with my fork? Cargo.toml:
[patch.crates-io] yake-rust = { git = "https://github.com/xamgore/yake-rust.git" }
Can I use it with my custom stopwords? I looked at it but it did not seem like it?
Edit: I will try to copy the changes to my local version.
That did not crash and seemingly worked, though I get a different result than with https://github.com/LIAAD/yake
For the text I pasted above, I now get (in the format lowercase keyword: score):
"entdecke": 0.13736775609881255,
"angebote": 0.13736775609881255,
"garten": 0.13736775609881255,
"newsletter": 0.13736775609881255,
"frühling": 0.13736775609881255,
"wöchentlicher": 0.22888381177383058,
"amazon": 0.2551467973374846,
"bereit": 0.2551467973374846,
"herren": 0.6057436594524441,
"mode": 0.6057436594524441)
but with the other yake implementation I get (in the format lowercase keyword: score):
"newsletter": "0.09705179139403544",
"angebote": "0.09705179139403544",
"garten": "0.09705179139403544",
"frühling": "0.09705179139403544",
"amazon": "0.2005079697193566",
"trends": "0.2718250226855089",
"damen": "0.2718250226855089",
Ok, nice, than everything is working fine.
Can I use it with my custom stopwords? I looked at it but it did not seem like it?
You have'to fork it and propose a PR to https://github.com/quesurifn/yake-rust/
Ok the previous comment contained a bit of extra logic it looks like. Here is LIAAD/yake compared to your fork:
Your fork: [('Entdecke', 0.12388833579335316), ('Garten', 0.13736775609881255), ('Angebote', 0.13736775609881255), ('Frühling', 0.13736775609881255), ('bereit', 0.2551467973374846), ('Trends', 0.6057436594524441), ('Herren', 0.6057436594524441), ('Damen', 0.6057436594524441), ('Mode', 0.6057436594524441), ('neusten', 1.2675842466345355)]
LIAAD/yake: [('Angebote', 0.09705179139403544), ('Garten', 0.09705179139403544), ('Frühling', 0.09705179139403544), ('Entdecke', 0.12363091320521931), ('bereit', 0.2005079697193566), ('Trends', 0.2718250226855089), ('Damen', 0.2718250226855089), ('Herren', 0.2718250226855089), ('Mode', 0.2718250226855089), ('neusten', 0.46553351027698087)]
With the same text and settings. I know some have the same score, and those with the same score I suppose could be in any order here, but still, these results are different.
Why would they be different? Is that not a problem? Or do we think LIAAD/yake has the problem?
(I suppose the different scores here have nothing to do with your changes to levenshtein though, since that is just the deduplication.)
Ok, then I believe #9 pull request solves the current issue. @quesurifn it was battle-tested, could you merge please?
Hi! Thanks for such a great crate. There is an issue, though. If we try a text in a non-English language, like the following one:
We'll quickly get an error:
If we look at the code:
I'm not sure, whether the algorithm is correct, but here is another implementation.