quesurifn / yake-rust

MIT License
6 stars 5 forks source link

Fails on non-English texts #3

Closed xamgore closed 4 weeks ago

xamgore commented 1 year ago

Hi! Thanks for such a great crate. There is an issue, though. If we try a text in a non-English language, like the following one:

Здравствуйте уважаемые подписчики и зрители канала гулагунец вам я Владимир осечкин Сегодня у нас 2 мая 2023 года сегодня будет очень важный знаковый эфир будем раскрывать очень важную информацию во многом это впервые будет озвучено на нашем канале Да и в целом в паблике пожалуйста напишите как звук Как слышно Как видно и тогда Через минуту начнем всех еще раз приветствую Спасибо и вижу Владислав модератор чата подключился тоже к просмотру Спасибо ему за помощь Спасибо сандрия и всей команде поддержки Кто донатит кто является спонсором нашего канала потому что именно благодаря вашей поддержке мы сохраняем нашу Независимость и что не менее важно эффективность Так отлично хорошо вижу Спасибо большое дорогие друзья если видно слышно хорошо мы долго думали по поводу того пускать ли в прямой эфир тех непосредственно людей которые сейчас дают очень важные свидетельские показания кто-то из них уже покинул территорию России кто-то находится в России и дает показания в том числе следователям

We'll quickly get an error:

index out of bounds: the len is 12 but the index is 12
  thread 'tests::it_works' panicked at 'index out of bounds: the len is 12 but the index is 12',
  /Users/user/.cargo/registry/src/github.com-1ecc6299db9ec823/natural-0.3.0/src/distance.rs:90:16

If we look at the code:

// ported from http://hetland.org/coding/python/levenshtein.py
pub fn levenshtein_distance(str1: &str, str2: &str) -> usize {
    let n = str1.len();
    let m = str2.len();

    let mut column: Vec<usize> = (0..n + 1).collect();
    // TODO this probaly needs to use graphemes
    let a_vec: Vec<char> = str1.chars().collect();
    let b_vec: Vec<char> = str2.chars().collect();
    for i in 1..m + 1 {
        let previous = column;
        column = vec![0; n + 1];
        column[0] = i;
        for j in 1..n + 1 {
            let add = previous[j] + 1;
            let delete = column[j - 1] + 1;
            let mut change = previous[j - 1];
            if a_vec[j - 1] != b_vec[i - 1] {  // <---------- here
                change = change + 1
            }
            column[j] = min3(add, delete, change);
        }
    }
    column[n]
}

I'm not sure, whether the algorithm is correct, but here is another implementation.

quesurifn commented 9 months ago

Hey, sorry for getting back to you almost a year later. I'll try to make this a priority in the coming days. There's no excuse for out of bounds. I haven't dealt with non-english chars, but I'm wondering if they exist in a different encoding and hence out of bounds are taking place. I'll take a look.

bunny-therapist commented 4 weeks ago

I encountered this error with German text as well.

quesurifn commented 4 weeks ago

@bunny-therapist Do you mind sending me some example text to reproduce?

bunny-therapist commented 4 weeks ago

Ok, I will try tomorrow.

xamgore commented 4 weeks ago

@quesurifn demo showing you can't index vector by its string's length.

str::len() Returns the length of self. This length is in bytes, not chars or graphemes.

char char is always four bytes in size.

bunny-therapist commented 4 weeks ago

I call get_n_best with n=10

text:

Dein wöchentlicher amazon Newsletter Entdecke jetzt die neuen Angebote um deinen Garten bereit für den Frühling zu machen. Auch die neusten Trends für Damen und Herren Mode sind wieder dabei.

custom stopwords attached stopwords.txt (I copied your code and modified Yake::new with a new Option<HashSet<String>> argument for this, to override the default stopwords)

ngram=1, remove_duplicates=True

This crashes in the levenshtein distance function with index out of bounds: the len is 8 but the index is 8

bunny-therapist commented 4 weeks ago

If I replace the special characters like "ä"->"ae" (in both text and stopwords), it does not crash. However, it does score keywords differently than using another yake implementation where I can keep the special characters (I guess it makes sense the results comes out slightly different when you change the input).

bunny-therapist commented 4 weeks ago

Hm. I see levenshtein distance function is actually in "natural", and there is a TODO about graphemes there. https://github.com/lexi-sh/rs-natural/blob/master/src/distance.rs#L79 Should this be an issue on that project instead maybe?

Alternatively, re-implement levenshtein distance in this project - unicode-segmentation is already a dependency, so it should (?) be easy to get the graphemes (it seems unicode-segmentation does that).

xamgore commented 4 weeks ago

@bunny-therapist could you please test with my fork? Cargo.toml:

[patch.crates-io]
yake-rust = { git = "https://github.com/xamgore/yake-rust.git" }
bunny-therapist commented 4 weeks ago

@bunny-therapist could you please test with my fork? Cargo.toml:

[patch.crates-io]
yake-rust = { git = "https://github.com/xamgore/yake-rust.git" }

Can I use it with my custom stopwords? I looked at it but it did not seem like it?

Edit: I will try to copy the changes to my local version.

bunny-therapist commented 4 weeks ago

That did not crash and seemingly worked, though I get a different result than with https://github.com/LIAAD/yake

For the text I pasted above, I now get (in the format lowercase keyword: score):

"entdecke": 0.13736775609881255,
"angebote":  0.13736775609881255,
"garten": 0.13736775609881255,
"newsletter": 0.13736775609881255,
"frühling": 0.13736775609881255,
"wöchentlicher": 0.22888381177383058,
"amazon": 0.2551467973374846,
"bereit": 0.2551467973374846,
"herren": 0.6057436594524441,
"mode": 0.6057436594524441)

but with the other yake implementation I get (in the format lowercase keyword: score):

"newsletter": "0.09705179139403544",
"angebote": "0.09705179139403544",
"garten": "0.09705179139403544",
"frühling": "0.09705179139403544",
"amazon": "0.2005079697193566",
"trends": "0.2718250226855089",
"damen": "0.2718250226855089",
xamgore commented 4 weeks ago

Ok, nice, than everything is working fine.

Can I use it with my custom stopwords? I looked at it but it did not seem like it?

You have'to fork it and propose a PR to https://github.com/quesurifn/yake-rust/

bunny-therapist commented 4 weeks ago

Ok the previous comment contained a bit of extra logic it looks like. Here is LIAAD/yake compared to your fork:

Your fork: [('Entdecke', 0.12388833579335316), ('Garten', 0.13736775609881255), ('Angebote', 0.13736775609881255), ('Frühling', 0.13736775609881255), ('bereit', 0.2551467973374846), ('Trends', 0.6057436594524441), ('Herren', 0.6057436594524441), ('Damen', 0.6057436594524441), ('Mode', 0.6057436594524441), ('neusten', 1.2675842466345355)]

LIAAD/yake: [('Angebote', 0.09705179139403544), ('Garten', 0.09705179139403544), ('Frühling', 0.09705179139403544), ('Entdecke', 0.12363091320521931), ('bereit', 0.2005079697193566), ('Trends', 0.2718250226855089), ('Damen', 0.2718250226855089), ('Herren', 0.2718250226855089), ('Mode', 0.2718250226855089), ('neusten', 0.46553351027698087)]

With the same text and settings. I know some have the same score, and those with the same score I suppose could be in any order here, but still, these results are different.

Why would they be different? Is that not a problem? Or do we think LIAAD/yake has the problem?

bunny-therapist commented 4 weeks ago

(I suppose the different scores here have nothing to do with your changes to levenshtein though, since that is just the deduplication.)

xamgore commented 4 weeks ago

Ok, then I believe #9 pull request solves the current issue. @quesurifn it was battle-tested, could you merge please?