paudan / opennlp_python

Python NLTK module for interfacing with the Apache OpenNLP
Other
28 stars 6 forks source link

Chunker issues #8

Open mlpacheco opened 4 years ago

mlpacheco commented 4 years ago

I've come accross a couple of issues with the chunker: 1 - It can't handle underscore characters, which I've solved by replacing them.

However, I am getting this strange issue while trying to process this sentence:

Confirm L and Confirm R options complete feature negotiation and are sent in response to Change R and Change L options , respectively .

I can tag it and get:

[('Confirm', 'NNP'), ('L', 'NNP'), ('and', 'CC'), ('Confirm', 'NNP'), ('R', 'NN'), ('options', 'NNS'), ('complete', 'JJ'), ('feature', 'NN'), ('negotiation', 'NN'), ('and', 'CC'), ('are', 'VBP'), ('sent', 'VBN'), ('in', 'IN'), ('response', 'NN'), ('to', 'TO'), ('Change', 'NNP'), ('R', 'NN'), ('and', 'CC'), ('Change', 'NNP'), ('L', 'NNP'), ('options', 'NNS'), (',', ','), ('respectively', 'RB'), ('.', '.')]

But once I attempt to run the chunker I run into an issue:

File "chunk_skyline.py", line 86, in segment_chunk
    tree = cp.parse(sentence)
  File "/homes/pachecog/.local/lib/python3.6/site-packages/nltk_opennlp-1.0.2-py3.6.egg/nltk_opennlp/chunkers.py", line 98, in parse
  File "/homes/pachecog/.local/lib/python3.6/site-packages/nltk_opennlp-1.0.2-py3.6.egg/nltk_opennlp/chunkers.py", line 160, in __get_nltk_parse_tree__
  File "/homes/pachecog/.local/lib/python3.6/site-packages/nltk_opennlp-1.0.2-py3.6.egg/nltk_opennlp/chunkers.py", line 149, in move_up
AttributeError: 'NoneType' object has no attribute 'remove'

I can't figure why such a simple sentence would fail. There seems to be no parent when doing: https://github.com/paudan/opennlp_python/blob/master/nltk_opennlp/chunkers.py#L149

paudan commented 4 years ago

I think I found out what was the problem - NLTK trees are indexed by equality, therefore parent detection in the chunk tree fails, if multiple cases of the same tagged token fall under the same noun phrase. Currently I added a temporary workaround to allow further processing, but this is more serious issue which might be addressed later