smartschat / cort

A toolkit for coreference resolution and error analysis.
MIT License
129 stars 34 forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128) #14

Closed minhlab closed 6 years ago

minhlab commented 7 years ago

I was trying to load a file which is composed of all gold sentences in CoNLL-2012 dev set when this error occurred. Bellow is the full stack trace:

In [2]: reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-57d8e778731d> in <module>()
----> 1 reference = corpora.Corpus.from_file("reference", open("output/Thu-Jan-12-17-22-15-CET-2017.gold.txt"))

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_file(description, coref_file)
     77
     78         return Corpus(description, sorted([from_string(doc) for doc in
---> 79                                            document_as_strings]))
     80
     81

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/corpora.pyc in from_string(string)
     12
     13 def from_string(string):
---> 14     return documents.CoNLLDocument(string)
     15
     16

/Users/cumeo/anaconda/lib/python2.7/site-packages/cort/core/documents.pyc in __init__(self, document_as_string)
    399         sd = StanfordDependencies.get_instance()
    400         dep_trees = sd.convert_trees(
--> 401             [parse.replace("NOPARSE", "S") for parse in parses],
    402         )
    403         sentences = []

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in convert_trees(self, ptb_trees, representation, universal, include_punct, include_erased, **kwargs)
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.pyc in <genexpr>((ptb_tree,))
    114                       include_erased=include_erased)
    115         return Corpus(self.convert_tree(ptb_tree, **kwargs)
--> 116                       for ptb_tree in ptb_trees)
    117
    118     @abstractmethod

/Users/cumeo/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.pyc in convert_tree(self, ptb_tree, representation, include_punct, include_erased, add_lemmas, universal)
     85         self._raise_on_bad_input(ptb_tree)
     86         self._raise_on_bad_representation(representation)
---> 87         tree = self.treeReader(ptb_tree)
     88         if tree is None:
     89             raise ValueError("Invalid Penn Treebank tree: %r" % ptb_tree)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

The data looks like this:

Minhs-MacBook-Pro:EvEn cumeo$ head output/Thu-Jan-12-17-22-15-CET-2017.gold.txt
#begin document (bc/cctv/00/cctv_0000); part 000
bc/cctv/00/cctv_0000    0   0   In  IN  (TOP(S(PP*  -   -   -   Speaker#1   *   *   *   *-
bc/cctv/00/cctv_0000    0   1   the DT  (NP(NP* -   -   -   Speaker#1   (DATE*  *   *   *   -
bc/cctv/00/cctv_0000    0   2   summer  NN  *)  summer  -   1   Speaker#1   *   *   *   *   -
bc/cctv/00/cctv_0000    0   3   of  IN  (PP*    -   -   -   Speaker#1   *   *   *   *   -
bc/cctv/00/cctv_0000    0   4   2005    CD  (NP*))))    -   -   -   Speaker#1   *)  *   *   *-
bc/cctv/00/cctv_0000    0   5   ,   ,   *   -   -   -   Speaker#1   *   *   *   *   -
bc/cctv/00/cctv_0000    0   6   a   DT  (NP(NP* -   -   -   Speaker#1   *   (ARG0*  *   *   -
bc/cctv/00/cctv_0000    0   7   picture NN  *)  picture -   8   Speaker#1   *   *)  *   *   -
bc/cctv/00/cctv_0000    0   8   that    WDT (SBAR(WHNP*)    -   -   -   Speaker#1   *   (R-ARG0*)   **  -

Anyone has any ideas how to fix this?

Best regards, Minh

smartschat commented 7 years ago

The error happens in PyStanfordDependencies. It looks like some Penn Treebank non-terminals in the CoNLL gold data are not contained in ASCII. I'll have a look at the issue, but it probably requires fixes to PyStanfordDependencies.

Due to different string handling the error does not happen in Python 3. Is using Python 3 an option for you?

minhlab commented 7 years ago

After some searching I realized that I can install cort using pip3 install. I can run the visualization now. Thanks!