smartschat / cort

A toolkit for coreference resolution and error analysis.
MIT License
129 stars 34 forks source link

cort-predict-raw runs on python2 but not python3.5 #17

Open bennytieu opened 7 years ago

bennytieu commented 7 years ago

I was trying to run cort-predict-raw with following command:

python3.5 /usr/local/bin/cort-predict-raw -in ~/data/pilot_44_docs/*.txt -model models/model-pair-train.obj -extractor cort.coreference.approaches.mention_ranking.extract_substructures -perceptron cort.coreference.approaches.mention_ranking.RankingPerceptron -clusterer cort.coreference.clusterer.all_ante -corenlp ~/systems/stanford/stanford-corenlp-full-2016-10-31

and got the following error message:

Traceback (most recent call last): File "/usr/local/bin/cort-predict-raw", line 136, in doc.system_mentions = mention_extractor.extract_system_mentions(doc) File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in extract_system_mentions for span in extract_system_mention_spans(document)] File "/usr/local/lib/python3.5/dist-packages/cort/core/mention_extractor.py", line 36, in for span in extract_system_mention_spans(document)] File "/usr/local/lib/python3.5/dist-packages/cort/core/mentions.py", line 126, in from_document i, sentence_span = document.get_sentence_id_and_span(span) TypeError: 'NoneType' object is not iterable 2017-04-27 09:17:06,058 WARNING Killing subprocess 14154 2017-04-27 09:17:06,395 INFO Subprocess seems to be stopped, exit code -9

It works without a problem with python2 though. I'm running this on Ubuntu16.04.

smartschat commented 7 years ago

Can you isolate (and post) the document which causes the error message?

bennytieu commented 7 years ago

I have isolated it to this string:

Contact for company: Sven Svensson 212 584 5242 sven.svensson@email.com.

I'm guessing it is the sequence of number that is at fault. Single instances of numbers are ok, for example, there are years like 2017 in other documents that are fine.

This example works:

Contact for company: Sven Svensson 584 5242 sven.svensson@email.com.

smartschat commented 7 years ago

I did some debugging, the first example is tokenized as ['Contact', 'for', 'company', ':', 'Sven', 'Svensson', '212Â\xa0584Â\xa05242', 'sven.svensson@email.com', '.']. I suspect that the TypeError happens because some representation I rely on handles the numbers as individual tokens. I will not be able to fix this right now, is using Python2 an option for you?

bennytieu commented 7 years ago

I will try and run on Python2 in the meantime or just skip this special case. I'm doing a study on efficiency, so it would be most optimal to run it using Python3. Thank you for your quick reply!