quesurifn / yake-rust

MIT License
5 stars 5 forks source link

Issue with punctuation and context building #14

Open bunny-therapist opened 2 weeks ago

bunny-therapist commented 2 weeks ago

The created contexts contain punctuation symbols. If a word is just composed of punctuation symbols, it should be skipped and the buffer emptied.

I fixed this is my branch here: https://github.com/bunny-therapist/yake-rust/commit/a9c3a9917a49830ed5133b1ef6b1bcbe57b671e5

The relevant part in LIAAD/yake is here: https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L59 The "exclude" chars in LIAAD/yake are what is called "punctuation" in yake-rust.

bunny-therapist commented 2 weeks ago

@xamgore

bunny-therapist commented 2 weeks ago

@xamgore - just making sure you saw this. Maybe you found a simpler way to solve this?

xamgore commented 2 weeks ago

Python code is a mess, really no way for me to comprehend it right now 😄 maybe on the weekend

I've seen that candidate_filtering also throws punctuation words out.

bunny-therapist commented 2 weeks ago

Yeah, but vocabulary is different from candidates apparently. The "buffer/buffer_words" vec we have end up with elements like "!" and "?" which then affect ctx.0/1 and thus wr, and wl, and thus frequency and relatedness.

xamgore commented 2 weeks ago

Hm, right. Can the candidate words have punctuation inside like abc!?def? I've never dealt with unicode segmentation.

bunny-therapist commented 2 weeks ago

Based on the python code, it skips words that are composed entirely of punctuation. So "abc!?def" would be ok, but "!?" would not.