v1.0 release thread - Githubissues

xamgore commented 3 weeks ago

Ok, my head is spinning, I suggest we all have the single place to discuss the following updates.

@bunny-therapist have aligned python & rust implementations, forked the repo and fixed a few things
@xamgore merged all those fixes into his fork, but have got different scores for 'google' text
16 issue is the last one

bunny-therapist commented 3 weeks ago

Ok, let's keep the discussion here then.

I should maybe clarify that the big "google" test I have in my branch does NOT pass. The other tests pass. I have just aligned the expected test result with the official "Yake!" webpage (using the LIAAD/yake implementation) results, because that is the goal.

I currently believe that if we handle the plural issue, then it is possible that we will match LIAAD/yake.

I believe this because clearly the basic stuff is there, given that the other tests are also based on the webpage/LIAAD, and if I only use half the google test, I get a match with webpage/LIAAD. The plural issue would be sufficient to explain the problem, given that it affects the few things that still differ.

bunny-therapist commented 3 weeks ago

Closes #10 Closes #8

bunny-therapist commented 3 weeks ago

To me, it seems like what LIAAD/yake does here is to when building the vocabulary check if the word has length > 3 and ends with an "s", and if so, the "s" is dropped (last character of word is not used). I tried to add something like that to vocabulary_building but I don't know how to do that without the rust compiler getting mad at me.

bunny-therapist commented 3 weeks ago

I suppose something like this? (I got it working in Rust Playground at least.)

fn main() {
    let word = "competitors";

    let mut actual_word: &str = word;
    if word.len() > 3 && word.ends_with("s") {
        actual_word = &word[0..word.len() - 1];
    }
    println!("{}", actual_word);
}

outputs "competitor".

Can we just use something like this in the vocabulary_building loop?

xamgore commented 3 weeks ago

let mut word: String = ...;

if word.ends_with('_') {
    word.pop();
}

If you wanna, let's do pair programming online? https://meet.google.com/nwe-bgsj-mpf

bunny-therapist commented 3 weeks ago

After dinner maybe :)

I am not entirely sure about "vocabulary" vs sentences vs candidates etc. I assume that what we want is to treat "competitors" as "competitor" etc, at least to a point.

bunny-therapist commented 3 weeks ago

Ah, yes, the getTerm method in LIAAD/yake is used everywhere. So this same logic is used when building sentences, building candidates, vocab etc. Whenever getting a "term", it is used. So the code treats "competitors" and "competitor" as the same word.

xamgore commented 3 weeks ago

Ok, let vocabulary be a map from s-less words into occurrences.

for (w_idx, word) in sentence.words.iter().enumerate() {
    if !word.is_empty()
        && word.chars().all(char::is_alphanumeric)
        && HashSet::from_iter(word.chars()).intersection(&self.config.punctuation).next().is_none()
    {
        let index = word.to_lowercase().to_single();  // <----
        let occurrence = Occurrence { shift_offset: shift + w_idx, idx, word, shift };
        words.entry(index).or_default().push(occurrence)
    }
}

Probably stripping the 's' should happen at all places where .to_lowercase() is applied.

DedupSubgram key
Words key
Context key,
Features key

quesurifn commented 3 weeks ago

Pointing this to 1.0.0-alpha branch to get all of these changes production ready

xamgore commented 3 weeks ago

Well, thanks, but it still WIP 😄

quesurifn commented 3 weeks ago

Shit.

quesurifn / yake-rust

v1.0 release thread #17

16 issue is the last one