nschneid / pysupersensetagger

AMALGrAM, an English supersense tagger written in Python
GNU General Public License v3.0
33 stars 12 forks source link

POS-sensitive lemmatization #18

Open nschneid opened 9 years ago

nschneid commented 9 years ago

The WordNet lemmatizer has an option to use coarse POS, but not fine-grained POS, which is sometimes required to disambiguate the lemma. The morph.py interface to the WordNet lemmatizer recognizes several cases like "fell" which can be a present tense verb (lemma: "fell") or a past tense verb (lemma: "fall"). The current rules are as follows:

if w=='fell' and p=='VBD': return 'fall'
elif w=='found' and p in {'VBD','VBN'}: return 'find'
elif w=='lay' and p=='VBD': return 'lie'
elif w=='saw' and p=='VBD': return 'see'
elif w=='people' and p=='NNS': return 'person'

These should be updated with additional cases, such as "smelt".

nschneid commented 9 years ago

Better:

if p in {'VBD','VBN'}:
  if w=='found': return 'find'
  elif w=='ground': return 'grind'
  elif w=='rent': return 'rend'
  elif w=='smelt': return 'smell'
  elif w=='wound': return 'wind'
  elif p=='VBD':
    if w=='fell': return 'fall'
    elif w=='lay': return 'lie'
    elif w=='saw': return 'see'
elif p[0]=='V' and w=='stove': return 'stove'  # WordNet has only the past/ppt form of 'stave', but apparently 'stove' can be a verb
# 'ridden' is a past participle of 'rid' and the past participle of 'ride'. The POS is not enough to disambiguate, but 'ride' (which WordNet gives) is probably more common.
elif p=='NNS' and w=='people': return 'person'