machinalis / iepy

Information Extraction in Python
BSD 3-Clause "New" or "Revised" License
905 stars 186 forks source link

Segmenter error #127

Open milesscherrer opened 7 years ago

milesscherrer commented 7 years ago

We are building a Swedish IEPY pipeline where we do not have a syntactic parser (there is none available in Swedish). Running the active learning core, we get the following error in the hydrate function of models.py.

File "/home/ubuntu/.local/lib/python3.5/site-packages/iepy/data/models.py", line 388, in self.syntactic_sentences = [doc.syntactic_sentences[s] for s in self.sentences] IndexError: list index out of range

We backtracked the error to our non-existent parsing output, which as we understand it is used by the segmenter. As we do not have a syntactic parser, is there some way of bypassing the segment-based labelling and only doing the document based? Or any way to bypass the syntactic parsing in the segmenter? As syntactic parsing was added in the 0.9.3 version, how did the segmenter work at that point?

jmansilla commented 7 years ago

It seems that the easiest hack from your side is to add a dummy SyntacticParsing step to your preprocess pipeline. It should return a sequences of parse trees strings (in the format that can be parsed by nltk.tree.Tree.fromstring).

So in pseudocode your dummy syntactic parser should be doing: syntactic_parsing = ["()" for sent in sentences]

milesscherrer commented 7 years ago

Thanks, I'll try that out. I also looked into the 0.9.2 version (before syntactic parsing was added) and saw that it was missing the line 388 in models.py.

self.syntactic_sentences = [doc.syntactic_sentences[s] for s in self.sentences]

We tried removing the line and got it working. Running it however, there was an error in the entity offset (entities pointing to the previous token) for most sentences with a couple of sentences with correct entity offsets in the beginning.

Not sure if the entity offset error is due to our general customisation of IEPY or if it could be due to the removal of line 388, but we are trying to backtrack the error.