Open bunny-therapist opened 3 weeks ago
@xamgore
Ok, I added one line to the tf_u
condition (this can probably be written in a nicer way?):
if occurrence.word.chars().nth(0).unwrap_or(' ').is_uppercase()
// New:
&& occurrence.word.chars().filter_map(
|c| if c.is_uppercase() { Some(c) } else { None }
).collect::<Vec<char>>().len() == 1
&& occurrence.shift != occurrence.shift_offset
{
cand.tf_u += 1.0;
}
and that aligned the results for "TopCoder":
// LIAAD/yake:
TopCoder: tf: 1.0, tf_a: 0.0, tf_u: 0.0, casing: 0.0, position: 0.0940478276166991, frequency: 1.0, different: 1.0, relatedness: 1.5
// yake-rust:
topcoder: tf: 1, tf_a: 0, tf_u: 0, casing: 0, position: 0.0940478276166991, frequency: 1, different: 1, relatedness: 1.5
I completely agree with the reasoning. Curious, why LIAAD have decided not to mark such tokens as uppercase? The paper makes no accent on this, just “begins with an uppercase, excluding the beginning of sentences”.
Kyle's approach is still sound, though. PayPal
, MasterCard
are as important as Paypal
or Mastercard
.
I don't know exactly why. The LIAAD/yake repo does not seem "blameable" - most of the code was developed offline, then uploaded as is and not touched.
It can't be because of words like "I" because they also check if length > 1 - and that handles that.
Could it have something to do with abbreviations?
I suppose it is up to Kyle whether or not we should do our own thing here or try to align everything with LIAAD/yake as much as possible.
Personally, I would vote to align it. I am trying to replace LIAAD/yake in our projects with this, and I bet there are others who would love a drop-in replacement in rust (example). I will make python bindings so it can be used as a drop-in replacement that is much faster and concurrent than LIAAD/yake. (That is why I got into this, I was making python bindings to replace LIAAD, then I noticed I got different results.)
If there are differences, one is stuck arguing that our results make more sense. (Though I do agree with you about the logic, albeit I have no clue why LIAAD did what they did.)
If we do it like LIAAD/yake, we should at least write a comment about it.
Ok, let it be just another field at Config
🤷🏻♂️.
Btw, probably Kyle could make us maintainers, so we could put python bindings right in the repo. Wasm too.
Is there any benefit to having everything in the same repo though? I suppose automatic releases of everything together, but I worry that the project structure will get a bit complicated. Especially if one wants to add more bindings.
Being able to release a new crate and a new python package together would be nice though. But I suppose it would still have to be a bit staggered, since one wants to publish to pypi after publishing the crate, since the python package would depend on the crate.
Can't say for sure, probably there is even a Github action exactly for such a pipeline. Do pyo3 bindings impose publishing a separate yake-rust-to-python
crate?
No. You publish a python package that has the rust crate as a dependency. The compiled binary gets built into any wheels built, but if there is no wheel matching the user's system, installing the python package will download the crate and compile it.
The python package contains python code, as well as some additional rust code that calls the crate, just with wrapping functions that can be decorated with macros to generate bindings. (I do not recommend adding these binding macros to the original rust code - you will likely end up with issues related to python garbage collection, threading, memory management etc. I intend to wrap them with additional rust code.)
Or rather, I have already done that, provisionally. In my local project I have tests which run both yake-rust thru the bindings and LIAAD/yake, with same config, and then compares the result.
That is how I am debugging this stuff, mostly.
The yake-rust crate counts
tf_a
andtf_u
here:LIAAD yake has the same here:
Just by looking at the case restrictions, it seems like yake-rust increments
tf_u
if the first character is uppercase, whereas LIAAD/yake increments if only the first character is uppercase (with an additional restrictioni>0
which means "not the first word in a sentence" - thei
is the index in the sentence - maybe the shift offset check is equivalent to this, I do not know).This means that yake-rust increments for e.g. "TopCoder" whereas LIAAD/yake does not. Because LIAAD/yake sees that there are multiple uppercase letters ("T" and "C"), it skips it; yake-rust just cares that the "T" is uppercase.
This debug print shows that
tf_u
(calledtf_n
in LIAAD/yake, but aligned here) differs, thus causing casing to also differ: