Open chaturv3di opened 2 years ago
Note that the preferred name of the Phraser
functionality has been changed to FrozenPhrases
for added clarity. (I'll use FrozenPhrases
in discussion, though Phraser
will still work in code.)
I believe the FrozenPhrases
models should work the exact same as Phrases
model (with regard to the same threshold/etc parameters as when the 'frozen' version was created), so this looks like a bug.
Perhaps this is a regression associated with more-recently added modes like the special handing of connector_words
or the npmi
scoring option?
It might be interesting to see the output of stacked-FrozenPhrases
on your full toy corpus (similar to how you show the stacked Phrases
works-as-expected), rather than on just one probe sentence. Also, assuming the output of the biphraser
and bigrams
models is identical, does a trigrams
(Phrases
) around a biphraser
(FrozenPhrases
) work?
Do you need to use the FrozenPhrases
in your setup, or would using the full Phrases
model(s) be an acceptable workaround? (The FrozenPhrases
is just an optimization, never required for any functionality.)
Problem description
I am trying to create a trigram model using the general approach:
bigrams = Phrases(text_corpus)
trigrams = Phrases(text_corpus_with_bigrams)
However, even though my
bigrams
model works well, I observe unexpected results fromtrigrams
model. In particular,trigrams.export_phrases()
is empty, but I can still stacktrigrams[bigrams[text_corpus]]
to obtain trigrams. This is unexpected.Phraser
objects, i.e.biphraser = Phraser(bigrams)
andtriphraser = Phraser(trigrams)
, the corresponding stacktriphraser[biphraser[text_corpus]]
produces only bigrams and no trigrams.I am unable to figure out what I'm doing wrong.
Steps/code/corpus to reproduce
I expect
trigrams
to containwilliams_v_raducanu
in the phrases it learns. However, we have the following outputs:However, for some reason, identification of trigrams still works (somewhat) if I stack the Phrases models.
Of course, I get different outputs when I train the model with different thresholds, but it is hard to decipher the behaviour/quality of any given
trigrams
model in the absence of phrase scores.So I have a couple of questions.
Phrases
model, i.e.trigrams[bigrams[sentence]]
, for my work?Versions