Model.Phrases - Specify what is considered a MWE component/word

ambs commented 2 years ago

Problem description

When using the Phrases model, words and punctuation are treated alike. While the corpus can be cleaned previously, it will destroy the corpus structure that is useful for some tasks. Just like the possibility to specify a list of connective words (ENGLISH_CONNECTOR_WORDS) it would be nice to be able to discard other tokens for being part of a word.

Possible solutions

If you feel this is something valuable to gensim, I am happy to provide a PR, just need to know what solution you prefer:

Allow the user to specify, a priori, the complete vocabulary. I really do not like this idea, but is a possibility
Allow the user to specify a function that, given a token returns a bool, whether that token can be part of a MWE or not
Add extra parameters to the scoring functions, so that they can score 0 if any of the words should not be taken into account (while it works, I do not like it, too)
Add a regexp that decides if a token is, or not, a word (I would use something like, if the token matches [!?.:;,#|0-9/\\\]\[{}()], it would be discarded...
any other option you think best

If we have a roadmap, let me know, and I will prepare a PR, and we can then polish.

Cheers

gojomo commented 2 years ago

Phrases takes a sequence of lists-of-tokens.

It's completely up to the user what's in those lists-of-tokens, and most projects will do some project-specific preprocessing to ensure the units (possibly including punctuation) most useful to their purposes are retained.

If there are extra filters desired, my senses is that it's better to apply them outside of Phrases, in a generic manner that allows the same filters to be reused elsewhere if desired. As far as I can tell all the proposed functionality can be done in a few lines as a wrapper around any raw corpus. For the Phrases class, which only needs one pass over the corpus, this can just be a generator. (For other models that need multiple iterations, like Word2Vec, the wrapper would have to be a little more complicated.)

For example:

if corpus was original corpus, & restricted_vocab the subset of acceptable tokens:

filtered_corpus = ([token for token in item if token in restricted_vocab] for item in corpus)

if filter_func is the desired test:

filtered_corpus = ([token for token in item if filt_func(token)] for item in corpus)

to apply the proposed regular-expression for rejection:

pattern = re.compile("[!?.:;,#|0-9/\\\]\[{}()]")
filtered_corpus = ([token for token in item if not pattern.match(token)] for item in corpus)

Unless there's some performance/discoverability/cpmprehension benefit to rolling this into Phrases, doing it outside seems the cleaner & better approach.

If considering this, though, I would caution that:

a per-word test might add a lot of overhead to the full-scan used by Phrases, slowing a run noticeably for little change in output (especially if only a small percentage of all tokens are being skipped)
it's not clear all non-words should be excluded from the sort of statistical-combining happening here, which even when it provides a benefit to downstream IR/calssification/etc tasks, tends to create lots of phrases that 'look wrong' to human sensibilities. If a number-token, or punctuation-token, appears so often in certain token-pairs it would pass the statistical test, maybe logically it is best considered part of its neighbor for all downstream analysis steps (even if non-aesthetic).
applying a filter to just the Phrases analysis could mean pairing-stats that don't match the 'true' pairings (that could happen when the Phrases is later applied to a full unfiltered corpus). I wouldn't expect this to be a very noticeable effect, except in extreme situations... but I similarly expect such extra filtering to only be needed in weird situations.

ambs commented 2 years ago

Dear @gojomo, thank you for taking the time to answer me.

I may be wrong on how Phrases work. Suppose I have the original sentences:

He was present at the European Commission . There was a lot of people .

If we remove punctuation, Phrases will get se sequence of tokens:

He was present at the European Commission There was a lot of people

In this way, Phrases will treat European Comission the same way it will treat Comission There. Of course I expect that, probability speaking, the first would occur a lot more times. But, suppose it doesn't. The model will suggest Comission There as a multiword, and probably not suggest European Comission as it should.

With the original punctuation, it might happen that the suggestion is Comission . and not European Comission (ok, my example is not the best... and that will probably not happen for such clear MWE).

In the other hand, if I have the way to tell phrases that, whenever it asks for MPI on (Comission, .) or (., There) that the result should be 0, then at the end they will not be considered a MWE.

If you are worried about performance, adding the two words in the call to the scorer function will keep the same performance for the current behavior, and only if the user overrides it, it will downgrade.

Am I misunderstanding any step of this process?

piskvorky commented 2 years ago

In this way, Phrases will treat European Comission the same way it will treat Comission There.

No – you pass in sentences (lists of tokens) to Phrases, not strings of space-separated-tokens.

ambs commented 2 years ago

Right. So you are suggesting I break my sentences in any non-word token. That might be a solution I did not think of. Will get back to you, probably tomorrow, as my main job is not, unfortunately, in NLP.

piskvorky commented 2 years ago

I don't know about non-word tokens. But definitely on full stops, to avoid that example of Commission There cross-sentence overlap.

ambs commented 2 years ago

Yes, it depends on what are the goals. ------- Original Message ------- On Friday, April 29th, 2022 at 13:05, Radim Řehůřek @.***> wrote:

I don't know about non-word tokens. But definitely on full stops, to avoid that example of Commission There overlap.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

gojomo commented 2 years ago

If the statistics suggest ('Commission', '.') is a 'good' bigram – occurring significantly more often than the individual word-frequencies would suggest – then it might benefit downstream info-retrieval, or classification, or clustering, steps to create a Commission_. pseudoword token. That is, it's not self-evident to me, based on aesthetics alone, that such extra constraints would offer a benefit in any real situation. Have you encountered such a situation?

If providing a differently-filtered set-of-texts to the Phrases somehow actually manages to make ('Commission', 'There') look, statistically, like a good-bigram (even if it wouldn't before creating those extra post-filtering artificial pairings), that seems to me risk a bigger issue than the filtering (might) be solving.

So I'm not convinced this would improve the results of Phrases, except on an non-quantitative aesthetic level – guaranteeing certain things a person might not think of as a Multi-Word-Expresssion never appear – that might be deleterious on quantititave evaluations.

And, if someone has a particular corpus, & set of goals, where such extra-filtering is proven to help, it's easy enough – and in some respects cleaner – to apply as a separate filtering step/wrapper, before passing to Phrases.

I believe that your suggestion that the scoring-function see the tokens (worda, wordb), not just their counts, might enable new possibilities as well, even separate from what a generic filter/token-disqualification-rule might enable. But, I'm not sure such possibilities would ever be necessary copared to other more-siple approaches. Still, those words themselves could conceivably be offered the scoring-function, and all existing scorers could ignore them fairly efficiently (so little complexity overhead added). So a vivid example of that working to provide a tangible benefit, with little cost to normal usage, would be welcome.

(But thinking about that just made me realize another way for a user to veto unwanted bigrams. After the survey-pass, iterate over the internal vocab dict, removing keys with unwanted tokens/characters/etc. Then, no later ops will create such bigrams. Though this would use a bit more memory to collect those counts only to be discarded, it might go faster by matching against only the unique terms in the final tally, rather than every term, repeatedly, as it comes up over and over again in the original texts.)

piskvorky / gensim

Model.Phrases - Specify what is considered a MWE component/word #3334

Problem description

Possible solutions