Closed chemlynx closed 7 years ago
Thanks. There have been a lot of these types of encoding bugs due to me not properly testing under python 3. In this case, it is because the lxml parser returns byte strings in python 2, but unicode strings in python 3. I've committed a fix, and will push a new version pending testing.
I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)
I see the error:
TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'
The issue seems to be with the replace_rsc_img_chars function in rsc.py.
Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.