Open bunny-therapist opened 2 weeks ago
@xamgore
@xamgore - just making sure you saw this. Maybe you found a simpler way to solve this?
Python code is a mess, really no way for me to comprehend it right now 😄 maybe on the weekend
I've seen that candidate_filtering
also throws punctuation words out.
Yeah, but vocabulary is different from candidates apparently. The "buffer/buffer_words" vec we have end up with elements like "!" and "?" which then affect ctx.0/1 and thus wr, and wl, and thus frequency and relatedness.
Hm, right. Can the candidate words have punctuation inside like abc!?def
? I've never dealt with unicode segmentation.
Based on the python code, it skips words that are composed entirely of punctuation. So "abc!?def" would be ok, but "!?" would not.
The created contexts contain punctuation symbols. If a word is just composed of punctuation symbols, it should be skipped and the buffer emptied.
I fixed this is my branch here: https://github.com/bunny-therapist/yake-rust/commit/a9c3a9917a49830ed5133b1ef6b1bcbe57b671e5
The relevant part in LIAAD/yake is here: https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L59 The "exclude" chars in LIAAD/yake are what is called "punctuation" in yake-rust.