jbjorne / TEES

Turku Event Extraction System
147 stars 44 forks source link

Solution to UnicodeEncodeError when input file contains Unicode character (Python 2.X) #11

Open chengkun-wu opened 10 years ago

chengkun-wu commented 10 years ago

Hi Jari,

If a text file with Unicode characters is passed to TEES as the input, you will get the "UnicodeEncodeError: 'ascii' codec can't encode character" exception (tested under python 2.7.3 | 64 bit, Mac Mavericks) , caused by the following code (around line 246) in ElemeTreeUtils.py, which was trying to write the ElementTree to the file.

# Open the output file
    if filename.endswith(".gz"):
        out=codecs.getwriter("utf-8")(GzipFile(filename,"wt"))
    else:
        out=codecs.open(filename,"wt","utf-8")
    print >> out, '<?xml version="1.0" encoding="UTF-8"?>'
    ElementTree.ElementTree(rootElement).write(out,"utf-8")

I had a look into the code, the problem can be solved in the following way

if filename.endswith(".gz"):
        out=GzipFile(filename,"wt")
    else:
        out=codecs.open(filename,"wt")
    print >> out, '<?xml version="1.0" encoding="UTF-8"?>'
    ElementTree.ElementTree(rootElement).write(out,"utf-8")

The reason for this can be found at http://stackoverflow.com/questions/10046755/write-xml-utf-8-file-with-utf-8-data-with-elementtree

However, even though this problem is solved, there might be other problems followed on.

For instance,

Traceback (most recent call last):
  File "classify.py", line 190, in <module>
    preprocessorParams=options.preprocessorParams, bioNLPSTParams=options.bioNLPSTParams)
  File "classify.py", line 72, in classify
    classifyInput = preprocessor.process(input, preprocessorOutput, preprocessorParams, model, [], fromStep=detectorSteps["PREPROCESS"], toStep=None, omitSteps=omitDetectorSteps["PREPROCESS"])
  File "/Users/dtcuser/Documents/workspace/TEES/PWTEES/Detectors/Preprocessor.py", line 49, in process
    xml = ToolChain.process(self, source, output, parameters, model, fromStep, toStep, omitSteps)
  File "/Users/dtcuser/Documents/workspace/TEES/PWTEES/Detectors/ToolChain.py", line 142, in process
    step[1](**stepArgs) # call the tool
  File "/Users/dtcuser/Documents/workspace/TEES/PWTEES/Tools/BANNER.py", line 349, in run
    offsets[0], offsets[1] = fixWhiteSpaceLessOffset(word, sentence.get("text"), int(offsets[0]), int(offsets[1]), map)
  File "/Users/dtcuser/Documents/workspace/TEES/PWTEES/Tools/BANNER.py", line 172, in fixWhiteSpaceLessOffset
    assert entityText == sentenceText[newBegin:newEnd+1], (entityText, sentenceText, (begin, end), (newBegin, newEnd), map)
jbjorne commented 10 years ago

Hi Chengkun,

Thanks, I've added the fix and committed the changes to the repository. However, while the system now seems to process unicode, please avoid using unicode input if at all possible. All the machine-learning systems, including the parser and the event detection components, have been trained on ASCII text. Therefore, they cannot recognize unicode and might interpret such characters in unexpected ways. For converting your input to ASCII you can use a tool such as https://github.com/spyysalo/unicode2ascii.

Best Regards, Jari

chengkun-wu commented 10 years ago

Hi Jari,

Thanks for the reply.

Yes I guess avoid Unicode input is a better solution. As I just ran into other problems like the updated post above.

Chengkun

chengkun-wu commented 10 years ago

Hi Jari,

I'm now trying to do some work on expanding TEES. Basically I had an in-house NER tool - PathNER, which detects pathway mentions (please refer to our paper http://www.ncbi.nlm.nih.gov/pubmed/24555844 ). How can I make use of TEES to detect events with both BANNER and PathNER? Do you have any suggestions for the best practice?

Thanks!

Chengkun