RSCHTMLReader throws bytes/string error

mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents

MIT License

305 stars 113 forks source link

I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)

I see the error:

TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'

The issue seems to be with the replace_rsc_img_chars function in rsc.py.

Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.

mcs07 / ChemDataExtractor

RSCHTMLReader throws bytes/string error #8