mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents
http://chemdataextractor.org
MIT License
305 stars 113 forks source link

RSCHTMLReader throws bytes/string error #8

Closed chemlynx closed 7 years ago

chemlynx commented 7 years ago

I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)

I see the error:

TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'

The issue seems to be with the replace_rsc_img_chars function in rsc.py.

Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.

mcs07 commented 7 years ago

Thanks. There have been a lot of these types of encoding bugs due to me not properly testing under python 3. In this case, it is because the lxml parser returns byte strings in python 2, but unicode strings in python 3. I've committed a fix, and will push a new version pending testing.