UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

fabrahman commented 5 years ago

Hi,

I tried using pretrained model to annotate a corpus, I first tried a small example where the sentences.txt file has only 5 sentences and it worked well. Then I switched to my own dataset which is a lot bigger, and I am getting this error in the first step when running targetid prediction:

Any suggestion?

_____________________
COMMAND: /home/hannah/open-sesame/sesame/targetid.py --mode predict --model_name fn1.7-pretrained-targetid --raw_input stories.dev
MODEL FOR TEST / PREDICTION:    logs/fn1.7-pretrained-targetid/best-targetid-1.7-model
PARSING MODE:   predict
_____________________

Reading data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll ...
# examples in data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
# examples with missing arguments : 526
Combined 19391 instances in data into 3413 instances.

Reading the lexical unit index file: data/fndata-1.7/luIndex.xml
# unique targets = 9421
# total targets = 13572
# targets with multiple LUs = 4151
# max LUs per target = 5

Reading pretrained embeddings from data/glove.6B.100d.txt ...
Traceback (most recent call last):
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/hannah/open-sesame/sesame/targetid.py", line 87, in <module>
    instances = [make_data_instance(line, i) for i,line in enumerate(fin)]
  File "sesame/raw_data.py", line 18, in make_data_instance
    for i in range(len(tokenized))]
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1909, in _morphy
    forms = apply_rules([form])
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1889, in apply_rules
    if form.endswith(old)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

fabrahman commented 5 years ago

So I realized the problem only exist for python2 in the nltk lemmatizer. I resolve it by using:

import io

and changing Line 86 in /sesame/targetid.py to:

with io.open(options.raw_input, "r", encoding='utf8') as fin:

I close the issue.

edivadiranatnom commented 4 years ago

Hi there, I fixed line 86 in targetid.py but now I'm getting another error with the encoding:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 8: ordinal not in range(128)

in raw_data.py on line 21:

File "/home/davide/open-sesame/sesame/targetid.py", line 88, in <module>
 instances = [make_data_instance(line, i) for i,line in enumerate(fin)]

File "sesame/raw_data.py", line 21, in make_data_instance
 i+1, tokenized[i], lemmatized[i], pos_tagged[i], index) for i in range(len(tokenized))]

Any idea?

swabhs / open-sesame

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128) #30