Stacking Phrases objects helps detect trigrams but stacking Phraser objects doesn't

Problem description

I am trying to create a trigram model using the general approach:

Train a bigram model bigrams = Phrases(text_corpus)
Then train a trigram model trigrams = Phrases(text_corpus_with_bigrams)

However, even though my bigrams model works well, I observe unexpected results from trigrams model. In particular,

The result of trigrams.export_phrases() is empty, but I can still stack trigrams[bigrams[text_corpus]] to obtain trigrams. This is unexpected.
Upon instantiating Phraser objects, i.e. biphraser = Phraser(bigrams) and triphraser = Phraser(trigrams), the corresponding stack triphraser[biphraser[text_corpus]] produces only bigrams and no trigrams.

I am unable to figure out what I'm doing wrong.

Steps/code/corpus to reproduce

from gensim.models.phrases import Phrases, Phraser, ENGLISH_CONNECTOR_WORDS
from gensim.parsing.preprocessing import remove_stopwords

sentences = [
    "everyone was excited about williams v raducanu".split(),
    "they were excited to see who wins williams v raducanu".split(),
    "did you get a chance to watch williams v raducanu".split(),
    "i got an apple watch yesterday so that i don't miss williams v raducanu".split(),
    "where did you catch williams v raducanu".split(),
    "we thought williams v raducanu was brilliant".split()
]

bigrams = Phrases(sentences, min_count=1, threshold=0.5, scoring='npmi', connector_words=ENGLISH_CONNECTOR_WORDS)

biphraser = Phraser(bigrams)
trigrams = Phrases(biphraser[sentences], min_count=1, threshold=0.25, scoring='npmi', connector_words=ENGLISH_CONNECTOR_WORDS)

I expect trigrams to contain williams_v_raducanu in the phrases it learns. However, we have the following outputs:

# No phrases to export
>>> trigrams.export_phrases()
{}

# Trying the same thing, but differently
>>> triphraser = Phraser(trigrams)
>>> triphraser.phrasegrams
{}

# So understandably, stacking Phraser objects doesn't work (replays biphraser output)
>>> triphraser[biphraser["everyone was excited about williams v raducanu".split()]]
['everyone_was', 'excited_about', 'williams_v', 'raducanu']

However, for some reason, identification of trigrams still works (somewhat) if I stack the Phrases models.

# Stacking Phrases objects miraculously works...
>>> for x in trigrams[bigrams[sentences]]:
...    print(x)
['everyone_was_excited_about', 'williams_v_raducanu']
['they_were_excited_to_see', 'who_wins_williams_v', 'raducanu']
['did_you_get_a_chance', 'to', 'watch_williams_v', 'raducanu']
['i_got_an_apple_watch', 'yesterday_so_that_i', "don't_miss_williams_v", 'raducanu']
['where_did_you_catch', 'williams_v_raducanu']
['we_thought_williams_v', 'raducanu_was_brilliant']

Of course, I get different outputs when I train the model with different thresholds, but it is hard to decipher the behaviour/quality of any given trigrams model in the absence of phrase scores.

So I have a couple of questions.

Is this the expected behaviour and just that my understanding/expectations are off?
If this is not the expected behaviour then, at least in the short term, is there any advice you could offer on how I can reliably use the stacked Phrases model, i.e. trigrams[bigrams[sentence]], for my work?

Versions

Linux-4.19.0-21-cloud-amd64-x86_64-with-debian-10.12
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]
Bits 64
NumPy 1.19.5
SciPy 1.7.3
gensim 4.2.0
FAST_VERSION 0

Note that the preferred name of the Phraser functionality has been changed to FrozenPhrases for added clarity. (I'll use FrozenPhrases in discussion, though Phraser will still work in code.)

I believe the FrozenPhrases models should work the exact same as Phrases model (with regard to the same threshold/etc parameters as when the 'frozen' version was created), so this looks like a bug.

Perhaps this is a regression associated with more-recently added modes like the special handing of connector_words or the npmi scoring option?

It might be interesting to see the output of stacked-FrozenPhrases on your full toy corpus (similar to how you show the stacked Phrases works-as-expected), rather than on just one probe sentence. Also, assuming the output of the biphraser and bigrams models is identical, does a trigrams (Phrases) around a biphraser (FrozenPhrases) work?

Do you need to use the FrozenPhrases in your setup, or would using the full Phrases model(s) be an acceptable workaround? (The FrozenPhrases is just an optimization, never required for any functionality.)

piskvorky / gensim