Freezing Trigram Phrase models yields inconsistent results

finnhacks42 commented 2 years ago

Problem description

Applying a Trigram phrase model yields different results after freeze()

Code to reproduce

import gensim
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

def pretty_print_articles(articles,name):
    print(name)
    print("-------------")
    for a in articles:
        print(" ".join(a))
    print("-------------")

# show version information
print("Gensim version:",gensim.__version__)

# Make demo data
articles = [
    'high risk of default raises concern at the central bank of brazil',
    'high school students demand raises in pay outside the central bank of brazil',
    'concern about rising prices reducing investment',
    'the school is at high risk',
    'simple noconnector trigram',
    'simple noconnector trigram'
]
for i, a in enumerate(articles):
    articles[i] = a.split(" ")

# build bigram model
bigram = Phrases(articles, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

# apply bigram model
bigram_articles = [bigram[a] for a in articles]

# build trigram model
trigram = Phrases(bigram_articles, min_count=1, threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS)

# apply trigram model
trigram_articles = [trigram[a] for a in bigram_articles]

# verify that bigrams and trigrams working as expected
pretty_print_articles(bigram_articles,'Bigrammed')
pretty_print_articles(trigram_articles,'Trigrammed')

# freeze models
bigram_f = bigram.freeze()
trigram_f = trigram.freeze()

# check that frozen models give the same results
bigram_articles_f = [bigram_f[a] for a in articles]
print('bigram models consistent?:',bigram_articles == bigram_articles_f)

# Problem here!
trigram_articles_f = [trigram_f[a] for a in bigram_articles]
print('trigram articles consistent?:',trigram_articles==trigram_articles_f)

print("Lifecycle events")
print("----------------")
print("bigram",bigram.lifecycle_events)
print("trigram:",trigram.lifecycle_events)
print("bigram_frozen",bigram_f.lifecycle_events)
print("trigram_frozen:",trigram_f.lifecycle_events)

Output

Gensim version: 4.1.2
Bigrammed
-------------
high_risk of default raises concern at the central_bank of brazil
high school students demand raises in pay outside the central_bank of brazil
concern about rising prices reducing investment
the school is at high_risk
simple_noconnector trigram
simple_noconnector trigram
-------------
Trigrammed
-------------
high_risk of default raises concern at the central_bank_of_brazil
high school students demand raises in pay outside the central_bank_of_brazil
concern about rising prices reducing investment
the school is at high_risk
simple_noconnector_trigram
simple_noconnector_trigram
-------------
bigram models consistent?: True
trigram articles consistent?: False
Lifecycle events
----------------
bigram [{'msg': 'built Phrases<45 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.736409', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
trigram: [{'msg': 'built Phrases<40 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749474', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
bigram_frozen [{'msg': 'exported FrozenPhrases<5 phrases, min_count=1, threshold=1> from Phrases<45 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749598', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]
trigram_frozen: [{'msg': 'exported FrozenPhrases<0 phrases, min_count=1, threshold=1> from Phrases<40 vocab, min_count=1, threshold=1, max_vocab_size=40000000> in 0.00s', 'datetime': '2022-04-15T06:55:09.749631', 'gensim': '4.1.2', 'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]', 'platform': 'Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17', 'event': 'created'}]

Further Info

I think the issue lies in export_phrases. When split is called, it cannot distinguish between '_'s added by the bigram or trigram model.

print("Bigram phrases:",bigram.export_phrases())
print("Trigram phrases",trigram.export_phrases())

Gives:

Bigram phrases: {'high_risk': 7.5, 'central_bank': 11.25, 'bank_of_brazil': 11.25, 'simple_noconnector': 11.25, 'noconnector_trigram': 11.25}
Trigram phrases {}

Versions

Linux-5.15.5-76051505-generic-x86_64-with-glibc2.17
Python 3.8.13 (default, Mar 28 2022, 11:38:47) 
[GCC 7.5.0]
Bits 64
NumPy 1.21.5
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1

finnhacks42 commented 2 years ago

This issue also leads to inconstant results in trigram models saved & reloaded from disk.

piskvorky commented 2 years ago

Thanks for reporting. Are you interested in figuring out the cause?

All code lives in the phrases module, and is fairly straightforward.

finnhacks42 commented 2 years ago

Yes, the issues is that phrases are stored in source_vocab as delimiter joined strings, ie 'chief_executive'. When you fit higher order models, these composite words get joined to other words ie 'chief_executiveofficer'. However, export phrases just splits on '' and assumes that the first token is worda, the last wordb and computes score(worda,wordb). Taking the example of 'chief_executive_officer' it would compute score('chief','officer') when it should be computing score('chief_executive','officer').

A hacky workaround is to use a different delimiter for each stage but then you end up with phrasesgram keys like 'chief-executive_officer'.

I think the easiest fix is probably to switch to making the phrase keys tuples. I should have time to work on it and put in a PR next week.

finnhacks42 commented 2 years ago

Looking a bit further into the code, changing phrase keys to tuples would not be completely trivial, as this is a component of the model that is serialized on save, so some more backwards compatibility code would be needed. I also noted this comment # 3.8 => 4.0: phrasegram keys are strings, not tuples with bytestrings which suggests the code has been refactored from tuples to strings in the past.

Do you have any thoughts on whether I should refactor the vocab keys back to being tuples. An alternative would be to update the documentation to make it clear that higher order models must use different delimiters and write a new class to simplify the construction & use of higher order models.

piskvorky commented 2 years ago

Thanks for looking into this. IIRC we went for strings to save on RAM, tuples introduce a lot memory overhead. These "phrases" models are memory-hungry, by the nature of what they do (but see also #1654).

But if freeze() is broken then that's not acceptable.

Taking the example of 'chief_executive_officer' it would compute score('chief','officer') when it should be computing score('chief_executive','officer').

Why idea why? I don't remember how all this works any more :( Can't we compute simply split on all _? Or calculate the score from full subcomponents? Or was there some algorithmic problem with that, I imagine there's a reason why it works as it does.

piskvorky commented 2 years ago

.

finnhacks42 commented 2 years ago

I guess the fundamental problem is that if you have 'chief_executive_officer' you don't know if the underlying tokens are 'chief_executive' and 'officer' or 'chief' and 'executive_officer'. You could score on all sub-components (after stripping out connector words) but that would mean new scoring functions that work flexibly on 2 or more words. For example, training a Phrase model over existing bigrams can yield valid bigrams (words newly paired because the previous bigraming has changed their individual frequency) , trigrams(bigram paired with word) & 4-grams (two bigrams get paired).

The issue with freeze() will occur whenever the tokens you are trying to create bigram models over contain the bigram delimiter. Creating trigrams by repeated application of the Phrases with the same delimiter could be seen as a special case of this. The issue could also occur if you trained a plain bigram model over tokens that contained the delimiter (for example if you used '-' as the delimiter and your tokens contained hyphenated words). This is probably ok though, since many models assume you have removed any special characters first (although it should probably be documented, and possibly throw a warning/error). If we did that, then training a higher order model with the same delimiter as the first would give that warning/error.

piskvorky / gensim