Closed JohannesBuchner closed 8 years ago
No model is perfect =(
Also, you should try to sentence tokenize your text first (this won't change the the POS tags much though):
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sentences = """We first attempt to tackle the will; how exactly we are going to see. Shall we see then?"""
>>> tagged_sentences = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
>>> tagged_sentences
[[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VB'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'MD'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.')], [('Shall', 'NN'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]]
But do note that even state-of-art tagger like Stanford POS Tagger makes the error too:
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-postagger-full-2015-12-09/stanford-postagger.jar
alvas@ubi:~$ export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-postagger-full-2015-12-09/models
alvas@ubi:~$ python
>>> from nltk.tag import StanfordPOSTagger
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> sentences = """We first attempt to tackle the will; how exactly we are going to see. Shall we see then?"""
>>> st = StanfordPOSTagger('english-bidirectional-distsim.tagger')
>>> [st.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[(u'We', u'PRP'), (u'first', u'JJ'), (u'attempt', u'NN'), (u'to', u'TO'), (u'tackle', u'VB'), (u'the', u'DT'), (u'will', u'NN'), (u';', u':'), (u'how', u'WRB'), (u'exactly', u'RB'), (u'we', u'PRP'), (u'are', u'VBP'), (u'going', u'VBG'), (u'to', u'TO'), (u'see', u'VB'), (u'.', u'.')], [(u'Shall', u'VB'), (u'we', u'PRP'), (u'see', u'VB'), (u'then', u'RB'), (u'?', u'.')]]
Interestingly, the good ol' HunPOS seems to get better tags:
$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> [ht.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VBP'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'NN'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.')], [('Shall', 'MD'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]]
The SENNA tagger does pretty well too:
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> [st.tag(word_tokenize(sent)) for sent in sent_tokenize(sentences)]
[[('We', u'PRP'), ('first', u'RB'), ('attempt', u'VBP'), ('to', u'TO'), ('tackle', u'VB'), ('the', u'DT'), ('will', u'NN'), (';', u':'), ('how', u'WRB'), ('exactly', u'RB'), ('we', u'PRP'), ('are', u'VBP'), ('going', u'VBG'), ('to', u'TO'), ('see', u'VB'), ('.', u'.')], [('Shall', u'MD'), ('we', u'PRP'), ('see', u'VB'), ('then', u'RB'), ('?', u'.')]]
With
maxent_tagger = load("taggers/maxent_treebank_pos_tagger/english.pickle") maxent_tagger.tag(tokens)
I get good results -- I think I will stick with that for the moment, since I don't want to install other executables (HunPOS seems to be an alternative though).
[('We', 'PRP'), ('first', 'RB'), ('attempt', 'VBD'), ('to', 'TO'), ('tackle', 'VB'), ('the', 'DT'), ('will', 'MD'), (';', ':'), ('how', 'WRB'), ('exactly', 'RB'), ('we', 'PRP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('see', 'VB'), ('.', '.'), ('Shall', 'NNP'), ('we', 'PRP'), ('see', 'VBP'), ('then', 'RB'), ('?', '.')]
Thanks @JohannesBuchner and @alvations. I guess there is nothing to do at this point, so I will close this.
The word "attempt" is classified by pos_tag as VBD. According to http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html this is "Verb, past tense". But "attempt" is present tense (VB) and attempted is the past tense (VBD).
Not sure what other words fall in this error class.
It would be nice if "going to", "will", "shall" could be used to indicate future tense in a sentence.