Enable User-Defined POS Pattern, Variable-Length MWE Candidate Generation

leobeeson commented 1 year ago

Situation

Current MWE candidate generation by mwe_utils.extract_mwes_from_sent() only generates bigram candidates for collocations of the following sequences:
- NC: [["NN", "NNS"], ["NN", "NNS"]]
- JNC: [["JJ"], ["NN", "NNS"]]
If we test the text segment New York State, we would get the following results:
- mwe_type == "NC" you capture York State (which is a false positive).
- mwe_type == "JNC" you capture New York (which is also a false positive).

Solution

Enable mwe_utils.extract_mwes_from_sent() to extract:
- variable-length MWE candidates,
- using a user-defined POS pattern for syntactical structure.
Possibly rename method to:
- mwe_utils.generate_candidate_mwe()
- mwe_utils.generate_candidate_collocations()
- etc.

User-defined POS Pattern for Syntactical Structure

Modify the mwe_type parameter or generate an additional optional parameter in the mwe_utils.extract_mwes_from_sent() method to be of type list[list[str]].
The user can provide an argument for a sequence of POS tags for the MWE he/she wants to extract.
- Alternative, we can enable POS-sequence aliasing, where aliases can be passed as arguments to mwe_type or the POS-sequence passed to the new parameter of type list[list[str]].
- e.g.: Currently, NC is an alias for [["NN", "NNS"], ["NN", "NNS"]] and JNC for [["JJ"], ["NN", "NNS"]].
Every element of the outer list can be a list of candidate POS tags, enabling:
- Flexibility in the POS-sequence pattern.
- Allow the use of POS Taggers that return not just the top probable POS tag, but several.
- Allow the use of more than one POS Tagger for pooling POS tagging classification.
- Compensate for POS Tagger misclassification errors.

Variable-Length MWE

Enable the use of wildcard, quantifier, and set-negation characters in user-defined POS sequences.
- Wildcards: .
- e.g.: "[.]?" would allow any POS tag, zero or one time.
  - e.g.: "[[DT][.]?[NN]]" would catch the bus as well as the blue bus.
- Quantifiers: ?, +, *, {n}, {n,}, {n,m}
- Any outer list element without a quantifier essentially means "match one and only one occurrence", i.e. [<POS>]{1}
- Character Set Negation: [^]
- e.g. to negate bigram that don't start with a determiner: "[[^DT][.]]"
User can set a max-length sequence.
- A default maximum length must be defined to stem complexity beyond a certain POS sequence length (quantifier-expanded or not).
- A warning must be logged to console when user sets a max-length above some computational threshold (TBD), or the expanded POS sequence exceeds this threshold.
Raise errors when quantifier-expanded POS sequences surpass the max-length argument.

Objective

Provide users with configurability of MWE patterns to extract based on their specific use cases.
Provide users with flexibility for dealing with POS Tagger misclassification errors.
Provide users with the ability for passing top-n POS tag candidates generated by POS Tagger instead of only top-1 POS tag candidate (as it is currently).
Provide users with the ability to use results from more than one POS Tagger, incorporating misalignment between POS Taggers (i.e. more than one candidate POS tag per token).
Provide users with the ability to extract MWE ignoring POS tag structure of collocations (i.e. generate strictly statistical collocations).
- i.e. to generate all possible trigrams: "[[.][.][.]]"

Justification

Current mwe_type argument options (i.e. NC and JNC) significantly restricts wordview's utility to users (e.g. NLP Engineers).
Current mwe_type argument options (i.e. NC and JNC) significantly restricts real-world, enterprise use cases/problems wordview can solve.

Downstream Work

Enable anchors, i.e. ^ and $ in POS sequences relative to a sentence or user-defined text segment.
Enable positive and negative lookaheads (?=)/(?!) and lookbehinds (?<=)/(?<!) in POS sequences.
Enable option to select from multiple POS Taggers wrapped by the library.
Enable option to select more than one POS Taggers wrapped by the library.
Enable option for user to provide his/her own (bespoke) POS-tagged tokenised text.

NB

This implementation is basically the implementation of the library tracer for tagging syntactical sequences of higher-order concepts (e.g. controlled vocabularies/mappings), but applied strictly to POS tags (i.e. lower-order syntactical concepts/mappings).

meghdadFar commented 1 year ago

For user-defined POS patterns, I suggest that we allow users to define a custom MWE pattern in the nltk.RegexpParser format so that we can directly use this NLTK method to extract those patterns.

# e.g. define only noun-noun or adj-noun compounds
mwe_pattern_nnc_bigram = 'NNC: {<JJ><NN.*>}'
mwe_pattern_jnc_bigram = 'JNC: {<JJ><NN.*>}'

# or define a generic <adj* noun+> compound pattern
mwe_pattern_nnc_ngram =  'NC: {<JJ>*<NN.*>+}'

We can predefine and offer a number of these patterns to the users, for standard MWE patterns. E.g. NCs, LVCs, VPCs, etc. so that the user can already select from a bunch of existing patterns.

Then we continue with NLTK to extract those patterns from sentences:

chunk_parser = nltk.RegexpParser(mwe_pattern_nnc_bigram)
sentence = "An example sentence"
tagged_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
parsed_sentence = chunk_parser.parse(tagged_sentence)
NNCs = []
for subtree in parsed_sentence.subtrees(filter=lambda t: t.label() == 'NNC'):
    NNCs.append(' '.join(word for word, tag in subtree.leaves()))

leobeeson commented 1 year ago

Sounds good. I'll check out the nltk.RegexpParser and focus on it, and chose an initial set of standard MWE patterns. Thanks.

meghdadFar / wordview