quesurifn / yake-rust

MIT License
6 stars 5 forks source link

Possible issues with context_building #12

Open bunny-therapist opened 4 weeks ago

bunny-therapist commented 4 weeks ago

The function context_building

The piece of code that adds entries to the contexts appears to add each thing twice. If there is no entry for "word", it creates an entry with ([w], []), then on the next line it adds w so we get it twice: ([w, w], []). Same with the entry for "w", it creates ([], [word]), then adds word so we get it twice, ([], [word, word]).

Correct me if I am wrong ,but I suspect that there should just be one entry (the context 0/1 elements appear to be incoming/outgoing edges for the word node) and that this is a mistake?

Should the initialization via or_insert not just create two empty vectors, then let the next line add it?

As a second point, I don't understand the leading if statement inside the words.iter() loop. It iterates word over the elements of words, so surely words.contains(word) is always true?

This is what the function would look like after fixing the entries and removing the conditional:

    fn context_building(&mut self, words: Words, sentences: Sentences) -> (Contexts, Words, Sentences) {
        let cloned_sentences = sentences.clone();
        let mut contexts = Contexts::new();
        for sentence in cloned_sentences {
            let words = sentence.words.iter().map(|w| w.to_lowercase()).collect::<Vec<String>>();
            let mut buffer = Vec::<String>::new();
            for word in words.iter() {
                let min_range = max(0, buffer.len() as i32 - self.config.window_size as i32) as usize;
                let max_range = buffer.len();
                let buffered_words = &buffer[min_range..max_range];
                for w in buffered_words {
                    let entry_1 = contexts.entry(word.to_string()).or_insert((Vec::<String>::new(), Vec::<String>::new()));
                    entry_1.0.push(w.to_string());
                    let entry_2 = contexts.entry(w.to_string()).or_insert((Vec::<String>::new(), Vec::<String>::new()));
                    entry_2.1.push(word.to_string());
                }
                buffer.push(word.to_string());
            }
        }
bunny-therapist commented 4 weeks ago

Note that I may very well be wrong here. I just can't make sense of this duplication.

bunny-therapist commented 4 weeks ago

@xamgore I see you noticed the same thing: https://github.com/xamgore/yake-rust/blob/master/yake_rust/src/lib.rs#L243

xamgore commented 4 weeks ago

Would you like to proceed on this together? Assigned you a contributor anyway.

bunny-therapist commented 4 weeks ago

I am not sure how to organize this. I have been tracking bugs all day and I think I found and fixed a few. Right now I am working on another...

xamgore commented 4 weeks ago

Ah, I see. I'll adapt your changes at my branch.

bunny-therapist commented 4 weeks ago

This is definitely a bug. The contexts should be initialized with empty vectors so we do not have duplicates. I fixed it in my fork: https://github.com/bunny-therapist/yake-rust

I am getting much closer to the LIAAD/yake results, but it is hard for me because I am at this point working with multiple versions of the code and it is very hard for me to track everything.

I will create issues for the other bugs I have found as well I think.

xamgore commented 4 weeks ago

Stick to your repo then, I'll just cherry-pick the commits. I'm more concerned of the design & maintainability.