quesurifn / yake-rust

MIT License
5 stars 5 forks source link

Align handling of hyphenated words with LIAAD/yake #20

Open bunny-therapist opened 2 days ago

bunny-therapist commented 2 days ago

This comes up in the google-text test, since it contains the word "competition-centric", but this is most easily demonstrated using a short example text:

"I am a competition-centric person! I really like competitions. Every competition is a hoot!"

Extracting 10 keywords with yake-rust (my branch, but this problem is not handled in any branch) returns (this is a set, not ordered - does not matter for this example):

{'person', 'competition', 'hoot', 'centric'}

whereas LIAAD/yake returns

{'person', 'competition', 'hoot', 'competition-centric'}

In other words, yake-rust treats "competition" and "centric" as two words, whereas LIAAD/yake keeps "competition-centric" as one word.

I tried removing - from the punctuation symbols, but that does not fix it. I tried to figure out if it had anything to do with the unicode segmentation, but that investigation is still ongoing.

bunny-therapist commented 2 days ago

@xamgore

bunny-therapist commented 2 days ago

It seems like unicode-segmentation might be doing this.


use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let word: &str = "competition-centric";
    println!("{:?}", word.split_word_bounds().last());

}

The above outputs "centric", and the .count() is 3. So it seems like it got split into "competition", "-", "centric". The "-" is made up of punctuation symbols, so we drop that. But we then count "compeition" and "centric" separately.

xamgore commented 2 days ago

Tokenization approaches differ, yeah. I bet segtok handles English texts better than unicode-segmentation. Not much of a choice, though. https://github.com/huggingface/tokenizers is a relatively novel crate, haven't tried it yet.

As for Yake, it should accept the tokenized input as argument.

bunny-therapist commented 2 days ago

So you would have to tokenize everything yourself? Wouldn't that make the crate a lot less useful? Or are you suggesting that as an option?

xamgore commented 2 days ago

Tokenize yourself or use the predefined fallback from yake, yeah.

It makes sense, as language specific tokenizers show better performance. Check out Razdel for ru or sentok for en, de.

bunny-therapist commented 2 days ago

Ok, as long as there is still a default fallback, I agree.

I am all for configurability, I just want to make sure it is an easy drop-in replacement for LIAAD/yake in most cases.

xamgore commented 2 days ago

Could you please share the list of tokens which python impl produces for google text? [[word, ..], ..]

bunny-therapist commented 1 day ago

Assuming that the list of tokens requested is the thing generated in this line:

 self.sentences_str = [ [w for w in split_contractions(web_tokenizer(s)) if not (w.startswith("'") and len(w) > 1) and len(w) > 0] for s in list(split_multi(text)) if len(s.strip()) > 0]     

Then you can find them in this attached file: sentences_str.json

bunny-therapist commented 1 day ago

From there it creates a list of "unique terms". This involves e.g. the plural-normalization. You can find those here: terms.json