Closed xamgore closed 3 weeks ago
Ok, let's keep the discussion here then.
I should maybe clarify that the big "google" test I have in my branch does NOT pass. The other tests pass. I have just aligned the expected test result with the official "Yake!" webpage (using the LIAAD/yake implementation) results, because that is the goal.
I currently believe that if we handle the plural issue, then it is possible that we will match LIAAD/yake.
I believe this because clearly the basic stuff is there, given that the other tests are also based on the webpage/LIAAD, and if I only use half the google test, I get a match with webpage/LIAAD. The plural issue would be sufficient to explain the problem, given that it affects the few things that still differ.
Closes #10 Closes #8
To me, it seems like what LIAAD/yake does here is to when building the vocabulary check if the word has length > 3 and ends with an "s", and if so, the "s" is dropped (last character of word is not used). I tried to add something like that to vocabulary_building
but I don't know how to do that without the rust compiler getting mad at me.
I suppose something like this? (I got it working in Rust Playground at least.)
fn main() {
let word = "competitors";
let mut actual_word: &str = word;
if word.len() > 3 && word.ends_with("s") {
actual_word = &word[0..word.len() - 1];
}
println!("{}", actual_word);
}
outputs "competitor".
Can we just use something like this in the vocabulary_building
loop?
let mut word: String = ...;
if word.ends_with('_') {
word.pop();
}
If you wanna, let's do pair programming online? https://meet.google.com/nwe-bgsj-mpf
After dinner maybe :)
I am not entirely sure about "vocabulary" vs sentences vs candidates etc. I assume that what we want is to treat "competitors" as "competitor" etc, at least to a point.
Ah, yes, the getTerm
method in LIAAD/yake is used everywhere. So this same logic is used when building sentences, building candidates, vocab etc. Whenever getting a "term", it is used. So the code treats "competitors" and "competitor" as the same word.
Ok, let vocabulary be a map from s-less words into occurrences.
for (w_idx, word) in sentence.words.iter().enumerate() {
if !word.is_empty()
&& word.chars().all(char::is_alphanumeric)
&& HashSet::from_iter(word.chars()).intersection(&self.config.punctuation).next().is_none()
{
let index = word.to_lowercase().to_single(); // <----
let occurrence = Occurrence { shift_offset: shift + w_idx, idx, word, shift };
words.entry(index).or_default().push(occurrence)
}
}
Probably stripping the 's' should happen at all places where .to_lowercase()
is applied.
DedupSubgram
keyWords
keyContext
key,Features
keyPointing this to 1.0.0-alpha branch to get all of these changes production ready
Well, thanks, but it still WIP 😄
Shit.
Ok, my head is spinning, I suggest we all have the single place to discuss the following updates.
16 issue is the last one