quesurifn / yake-rust

MIT License
6 stars 5 forks source link

Bug in casing #19

Open bunny-therapist opened 3 weeks ago

bunny-therapist commented 3 weeks ago

The yake-rust crate counts tf_a and tf_u here:

            for occurrence in word {
                if occurrence.word.chars().all(|c| c.is_uppercase()) && occurrence.word.len() > 1 {
                    cand.tf_a += 1.0;
                }
                if occurrence.word.chars().nth(0).unwrap_or(' ').is_uppercase() && occurrence.shift != occurrence.shift_offset {
                    cand.tf_u += 1.0;
                }
            }

LIAAD yake has the same here:

            if len(word) == len([c for c in word if c.isupper()]):
                return "a"  # This corresponds to incrementing tf_a
            if len([c for c in word if c.isupper()]) == 1 and len(word) > 1 and word[0].isupper() and i > 0:
                return "n"  # This corresponds to incrementing tf_u

Just by looking at the case restrictions, it seems like yake-rust increments tf_uif the first character is uppercase, whereas LIAAD/yake increments if only the first character is uppercase (with an additional restriction i>0 which means "not the first word in a sentence" - the i is the index in the sentence - maybe the shift offset check is equivalent to this, I do not know).

This means that yake-rust increments for e.g. "TopCoder" whereas LIAAD/yake does not. Because LIAAD/yake sees that there are multiple uppercase letters ("T" and "C"), it skips it; yake-rust just cares that the "T" is uppercase.

This debug print shows that tf_u (called tf_n in LIAAD/yake, but aligned here) differs, thus causing casing to also differ:

// LIAAD/yake:
TopCoder: tf: 1.0, tf_a: 0.0, tf_u: 0.0, casing: 0.0, position: 0.0940478276166991, frequency: 1.0, different: 1.0, relatedness: 1.5
// yake-rust:
topcoder: tf: 1, tf_a: 0, tf_u: 1, casing: 1, position: 0.0940478276166991, frequency: 1, different: 1, relatedness: 1.5
bunny-therapist commented 3 weeks ago

@xamgore

bunny-therapist commented 3 weeks ago

Ok, I added one line to the tf_u condition (this can probably be written in a nicer way?):

                if occurrence.word.chars().nth(0).unwrap_or(' ').is_uppercase()

                    // New:
                    && occurrence.word.chars().filter_map(
                        |c| if c.is_uppercase() { Some(c) } else { None }
                        ).collect::<Vec<char>>().len() == 1

                    && occurrence.shift != occurrence.shift_offset
                {
                    cand.tf_u += 1.0;
                }

and that aligned the results for "TopCoder":

// LIAAD/yake:
TopCoder: tf: 1.0, tf_a: 0.0, tf_u: 0.0, casing: 0.0, position: 0.0940478276166991, frequency: 1.0, different: 1.0, relatedness: 1.5
// yake-rust:
topcoder: tf: 1, tf_a: 0, tf_u: 0, casing: 0, position: 0.0940478276166991, frequency: 1, different: 1, relatedness: 1.5
xamgore commented 3 weeks ago

I completely agree with the reasoning. Curious, why LIAAD have decided not to mark such tokens as uppercase? The paper makes no accent on this, just “begins with an uppercase, excluding the beginning of sentences”.

Kyle's approach is still sound, though. PayPal, MasterCard are as important as Paypal or Mastercard.

bunny-therapist commented 3 weeks ago

I don't know exactly why. The LIAAD/yake repo does not seem "blameable" - most of the code was developed offline, then uploaded as is and not touched.

It can't be because of words like "I" because they also check if length > 1 - and that handles that.

Could it have something to do with abbreviations?

I suppose it is up to Kyle whether or not we should do our own thing here or try to align everything with LIAAD/yake as much as possible.

Personally, I would vote to align it. I am trying to replace LIAAD/yake in our projects with this, and I bet there are others who would love a drop-in replacement in rust (example). I will make python bindings so it can be used as a drop-in replacement that is much faster and concurrent than LIAAD/yake. (That is why I got into this, I was making python bindings to replace LIAAD, then I noticed I got different results.)

If there are differences, one is stuck arguing that our results make more sense. (Though I do agree with you about the logic, albeit I have no clue why LIAAD did what they did.)

bunny-therapist commented 3 weeks ago

If we do it like LIAAD/yake, we should at least write a comment about it.

xamgore commented 3 weeks ago

Ok, let it be just another field at Config 🤷🏻‍♂️.

Btw, probably Kyle could make us maintainers, so we could put python bindings right in the repo. Wasm too.

bunny-therapist commented 3 weeks ago

Is there any benefit to having everything in the same repo though? I suppose automatic releases of everything together, but I worry that the project structure will get a bit complicated. Especially if one wants to add more bindings.

bunny-therapist commented 3 weeks ago

Being able to release a new crate and a new python package together would be nice though. But I suppose it would still have to be a bit staggered, since one wants to publish to pypi after publishing the crate, since the python package would depend on the crate.

xamgore commented 3 weeks ago

Can't say for sure, probably there is even a Github action exactly for such a pipeline. Do pyo3 bindings impose publishing a separate yake-rust-to-python crate?

bunny-therapist commented 3 weeks ago

No. You publish a python package that has the rust crate as a dependency. The compiled binary gets built into any wheels built, but if there is no wheel matching the user's system, installing the python package will download the crate and compile it.

The python package contains python code, as well as some additional rust code that calls the crate, just with wrapping functions that can be decorated with macros to generate bindings. (I do not recommend adding these binding macros to the original rust code - you will likely end up with issues related to python garbage collection, threading, memory management etc. I intend to wrap them with additional rust code.)

bunny-therapist commented 3 weeks ago

Or rather, I have already done that, provisionally. In my local project I have tests which run both yake-rust thru the bindings and LIAAD/yake, with same config, and then compares the result.

That is how I am debugging this stuff, mostly.