meghdadFar / wordview

A Python package for Exploratory Data Analysis (EDA) for text-based data.
MIT License
11 stars 1 forks source link

Enable User-Defined POS Pattern, Variable-Length MWE Candidate Generation #42

Closed leobeeson closed 1 year ago

leobeeson commented 1 year ago

Situation

Solution

User-defined POS Pattern for Syntactical Structure

Variable-Length MWE

Objective

Justification

Downstream Work

NB

meghdadFar commented 1 year ago

For user-defined POS patterns, I suggest that we allow users to define a custom MWE pattern in the nltk.RegexpParser format so that we can directly use this NLTK method to extract those patterns.

# e.g. define only noun-noun or adj-noun compounds
mwe_pattern_nnc_bigram = 'NNC: {<JJ><NN.*>}'
mwe_pattern_jnc_bigram = 'JNC: {<JJ><NN.*>}'

# or define a generic <adj* noun+> compound pattern
mwe_pattern_nnc_ngram =  'NC: {<JJ>*<NN.*>+}'

We can predefine and offer a number of these patterns to the users, for standard MWE patterns. E.g. NCs, LVCs, VPCs, etc. so that the user can already select from a bunch of existing patterns.

Then we continue with NLTK to extract those patterns from sentences:

chunk_parser = nltk.RegexpParser(mwe_pattern_nnc_bigram)
sentence = "An example sentence"
tagged_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
parsed_sentence = chunk_parser.parse(tagged_sentence)
NNCs = []
for subtree in parsed_sentence.subtrees(filter=lambda t: t.label() == 'NNC'):
    NNCs.append(' '.join(word for word, tag in subtree.leaves()))
leobeeson commented 1 year ago

Sounds good. I'll check out the nltk.RegexpParser and focus on it, and chose an initial set of standard MWE patterns. Thanks.