UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

lfoppiano commented 8 years ago

Dear all, after reading the paper and find it very interesting I wanted to try the application, so I cloned the repository. I've run the setup (using virtualenv) and I launched

python author_disambiguation.py

But then I get the following error:

Traceback (most recent call last):
  File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/numpy/lib/format.py", line 638, in read_array
    array = pickle.load(fp, **pickle_kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lfoppiano/development/inria/disambiguation/paper-author-disambiguation/beard/examples/author_disambiguation.py", line 74, in <module>
    X = data["X"]
  File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/numpy/lib/npyio.py", line 224, in __getitem__
    pickle_kwargs=self.pickle_kwargs)
  File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/numpy/lib/format.py", line 644, in read_array
    "to numpy.load" % (err,))
UnicodeError: Unpickling a python object failed: UnicodeDecodeError('ascii', b'Salt, Jos\xc3\xa9', 9, 10, 'ordinal not in range(128)')
You may need to pass the encoding= option to numpy.load

I've tried to use different encodings when loading the file data = np.load("data/author-disambiguation.npz")

but with no success.

What am I missing?

glouppe commented 8 years ago

Hi! Can you try the following:

data = np.load("data/author-disambiguation.npz", encoding="latin1")

natsheh commented 8 years ago

@lfoppiano I recommend trying out the example here: https://github.com/inspirehep/beard/tree/master/examples/applications/author-disambiguation

lfoppiano commented 8 years ago

@glouppe I've tried also that and I've got another error: Traceback (most recent call last): File "/Users/lfoppiano/development/inria/disambiguation/paper-author-disambiguation/beard/examples/author_disambiguation.py", line 86, in <module> block_clusterer.fit(X) File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/beard-0.0-py3.5.egg/beard/clustering/blocking.py", line 319, in fit File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/beard-0.0-py3.5.egg/beard/clustering/blocking.py", line 178, in _validate File "/Users/lfoppiano/development/inria/inria-virtualenv/lib/python3.5/site-packages/beard-0.0-py3.5.egg/beard/clustering/blocking_funcs.py", line 385, in block_last_name_first_initial TypeError: string indices must be integers

@natsheh I haven't seen that page. I'll try it out.

In principle I though that author_disambiguation.py was a prepared script to run quickly the whole pipeline, isn't it?

glouppe commented 8 years ago

author_disambiguation.py is a very simplified version of what is described in the paper. To reproduce our results, you should check examples.applications/author-disambiguation instead.

lfoppiano commented 8 years ago

@glouppe OK, thanks :)

MSusik commented 8 years ago

And an example of the input data is available here: https://github.com/inspirehep/beard/tree/master/examples/data

lfoppiano commented 8 years ago

Indeed. I've manage to run the sampling.py (using python2). How should I generate ethnicity_estimator.pickle?

lfoppiano commented 8 years ago

I close this issue as it is related to the fact that I was using Python 3.

inspirehep / beard

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128) #87