Open finnhacks42 opened 2 years ago
This issue also leads to inconstant results in trigram models saved & reloaded from disk.
Thanks for reporting. Are you interested in figuring out the cause?
All code lives in the phrases module, and is fairly straightforward.
Yes, the issues is that phrases are stored in source_vocab
as delimiter joined strings, ie 'chief_executive'. When you fit higher order models, these composite words get joined to other words ie 'chief_executiveofficer'. However, export phrases just splits on '' and assumes that the first token is worda, the last wordb and computes score(worda,wordb)
. Taking the example of 'chief_executive_officer' it would compute score('chief','officer')
when it should be computing score('chief_executive','officer')
.
A hacky workaround is to use a different delimiter for each stage but then you end up with phrasesgram keys like 'chief-executive_officer'.
I think the easiest fix is probably to switch to making the phrase keys tuples. I should have time to work on it and put in a PR next week.
Looking a bit further into the code, changing phrase keys to tuples would not be completely trivial, as this is a component of the model that is serialized on save, so some more backwards compatibility code would be needed. I also noted this comment # 3.8 => 4.0: phrasegram keys are strings, not tuples with bytestrings
which suggests the code has been refactored from tuples to strings in the past.
Do you have any thoughts on whether I should refactor the vocab
keys back to being tuples. An alternative would be to update the documentation to make it clear that higher order models must use different delimiters and write a new class to simplify the construction & use of higher order models.
Thanks for looking into this. IIRC we went for strings to save on RAM, tuples introduce a lot memory overhead. These "phrases" models are memory-hungry, by the nature of what they do (but see also #1654).
But if freeze()
is broken then that's not acceptable.
Taking the example of 'chief_executive_officer' it would compute score('chief','officer') when it should be computing score('chief_executive','officer').
Why idea why? I don't remember how all this works any more :( Can't we compute simply split on all _
? Or calculate the score from full subcomponents? Or was there some algorithmic problem with that, I imagine there's a reason why it works as it does.
.
I guess the fundamental problem is that if you have 'chief_executive_officer' you don't know if the underlying tokens are 'chief_executive' and 'officer' or 'chief' and 'executive_officer'. You could score on all sub-components (after stripping out connector words) but that would mean new scoring functions that work flexibly on 2 or more words. For example, training a Phrase model over existing bigrams can yield valid bigrams (words newly paired because the previous bigraming has changed their individual frequency) , trigrams(bigram paired with word) & 4-grams (two bigrams get paired).
The issue with freeze()
will occur whenever the tokens you are trying to create bigram models over contain the bigram delimiter. Creating trigrams by repeated application of the Phrases with the same delimiter could be seen as a special case of this. The issue could also occur if you trained a plain bigram model over tokens that contained the delimiter (for example if you used '-' as the delimiter and your tokens contained hyphenated words). This is probably ok though, since many models assume you have removed any special characters first (although it should probably be documented, and possibly throw a warning/error). If we did that, then training a higher order model with the same delimiter as the first would give that warning/error.
Problem description
Applying a Trigram phrase model yields different results after
freeze()
Code to reproduce
Output
Further Info
I think the issue lies in
export_phrases
. When split is called, it cannot distinguish between '_'s added by the bigram or trigram model.Gives:
Versions