nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.6k stars 2.89k forks source link

"attempt" classified as past verb #1381

Closed JohannesBuchner closed 8 years ago

JohannesBuchner commented 8 years ago

The word "attempt" is classified by pos_tag as VBD. According to http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html this is "Verb, past tense". But "attempt" is present tense (VB) and attempted is the past tense (VBD).

import nltk
sentence = """We first attempt to tackle the will; how exactly we are going to see. Shall we see then?"""
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
print tagged

[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VBD'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'MD'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.'), ('Shall', 'NNP'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]

Not sure what other words fall in this error class.

It would be nice if "going to", "will", "shall" could be used to indicate future tense in a sentence.

alvations commented 8 years ago

No model is perfect =(

Also, you should try to sentence tokenize your text first (this won't change the the POS tags much though):

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sentences = """We first attempt to tackle the will; how exactly we are going to see. Shall we see then?"""
>>> tagged_sentences = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
>>> tagged_sentences
[[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VB'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'MD'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.')], [('Shall', 'NN'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]]

But do note that even state-of-art tagger like Stanford POS Tagger makes the error too:

alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-12-09/stanford-postagger.jar
alvas@ubi:~$ export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-12-09/models
alvas@ubi:~$ python
>>> from nltk.tag import StanfordPOSTagger
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sentences = """We first attempt to tackle the will; how exactly we are going to see. Shall we see then?"""
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> [st.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[(u'We', u'PRP'), (u'first', u'JJ'), (u'attempt', u'NN'), (u'to', u'TO'), (u'tackle', u'VB'), (u'the', u'DT'), (u'will', u'NN'), (u';', u':'), (u'how', u'WRB'), (u'exactly', u'RB'), (u'we', u'PRP'), (u'are', u'VBP'), (u'going', u'VBG'), (u'to', u'TO'), (u'see', u'VB'), (u'.', u'.')], [(u'Shall', u'VB'), (u'we', u'PRP'), (u'see', u'VB'), (u'then', u'RB'), (u'?', u'.')]]

1214 asks for better basic NLP tools and I think this issue would fall under that too.

alvations commented 8 years ago

Interestingly, the good ol' HunPOS seems to get better tags:

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> [ht.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VBP'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'NN'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.')], [('Shall', 'MD'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]]

The SENNA tagger does pretty well too:

>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> [st.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[('We', u'PRP'), ('first', u'RB'), ('attempt', u'VBP'), ('to', u'TO'), ('tackle', u'VB'), ('the', u'DT'), ('will', u'NN'), (';', u':'), ('how', u'WRB'), ('exactly', u'RB'), ('we', u'PRP'), ('are', u'VBP'), ('going', u'VBG'), ('to', u'TO'), ('see', u'VB'), ('.', u'.')], [('Shall', u'MD'), ('we', u'PRP'), ('see', u'VB'), ('then', u'RB'), ('?', u'.')]]
JohannesBuchner commented 8 years ago

With maxent_tagger = load("taggers/maxent_treebank_pos_tagger/english.pickle") maxent_tagger.tag(tokens)

I get good results -- I think I will stick with that for the moment, since I don't want to install other executables (HunPOS seems to be an alternative though).

[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VBD'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'MD'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.'), ('Shall', 'NNP'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]

stevenbird commented 8 years ago

Thanks @JohannesBuchner and @alvations. I guess there is nothing to do at this point, so I will close this.