Open bunny-therapist opened 3 weeks ago
@xamgore
It seems like unicode-segmentation might be doing this.
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let word: &str = "competition-centric";
println!("{:?}", word.split_word_bounds().last());
}
The above outputs "centric", and the .count()
is 3. So it seems like it got split into "competition", "-", "centric". The "-" is made up of punctuation symbols, so we drop that. But we then count "compeition" and "centric" separately.
Tokenization approaches differ, yeah. I bet segtok
handles English texts better than unicode-segmentation
. Not much of a choice, though. https://github.com/huggingface/tokenizers is a relatively novel crate, haven't tried it yet.
As for Yake
, it should accept the tokenized input as argument.
So you would have to tokenize everything yourself? Wouldn't that make the crate a lot less useful? Or are you suggesting that as an option?
Ok, as long as there is still a default fallback, I agree.
I am all for configurability, I just want to make sure it is an easy drop-in replacement for LIAAD/yake in most cases.
Could you please share the list of tokens which python impl produces for google text? [[word, ..], ..]
Assuming that the list of tokens requested is the thing generated in this line:
self.sentences_str = [ [w for w in split_contractions(web_tokenizer(s)) if not (w.startswith("'") and len(w) > 1) and len(w) > 0] for s in list(split_multi(text)) if len(s.strip()) > 0]
Then you can find them in this attached file: sentences_str.json
From there it creates a list of "unique terms". This involves e.g. the plural-normalization. You can find those here: terms.json
This comes up in the google-text test, since it contains the word "competition-centric", but this is most easily demonstrated using a short example text:
"I am a competition-centric person! I really like competitions. Every competition is a hoot!"
Extracting 10 keywords with yake-rust (my branch, but this problem is not handled in any branch) returns (this is a set, not ordered - does not matter for this example):
whereas LIAAD/yake returns
In other words, yake-rust treats "competition" and "centric" as two words, whereas LIAAD/yake keeps "competition-centric" as one word.
I tried removing
-
from the punctuation symbols, but that does not fix it. I tried to figure out if it had anything to do with the unicode segmentation, but that investigation is still ongoing.