sensiblecodeio / scraperwiki-python

ScraperWiki Python library for scraping and saving data
https://scraperwiki.com
BSD 2-Clause "Simplified" License
160 stars 69 forks source link

Always return UTF-8 strings from pdftoxml. #76

Closed petterreinholdtsen closed 9 years ago

petterreinholdtsen commented 9 years ago

The pdftoxml method on the old scraperwiki site returned UTF-8 strings. Change this version to do the same.

Fixes issue #38

scraperdragon commented 9 years ago

The text of the pull request doesn't quite make sense compared to the code.

xmldata.decode('utf-8') will take a series of bytes (i.e. the UTF-8 encoded output of pdftohtml) and decode them into python unicode objects; e.g.

>>> '\xc2\xa3'.decode('utf-8')
u'\xa3'
>>> print '\xc2\xa3'.decode('utf-8')
£

So I believe that the actual thing you're doing is:

"Always return unicode strings from pdftoxml; The pdftoxml method on the old scraperwiki site returned unicode strings. Change this version to do the same. Fixes issue #38"

(Which makes a lot of sense - everything should return native unicode strings)

Could you amend the commit message?

petterreinholdtsen commented 9 years ago

[Dragon Dave McKee]

So I believe that the actual thing you're doing is:

"Always return unicode strings from pdftoxml; The pdftoxml method on the old scraperwiki site returned unicode strings. Change this version to do the same. Fixes issue #38"

(Which makes a lot of sense - everything should return native unicode strings)

Could you amend the commit message?

I could, but am not quite sure how do replace it in an already pushed branch without rebasing. Do you want me to rebase the patch and submit it again with a new comment? Or did I misunderstand your suggestion?

Happy hacking Petter Reinholdtsen

petterreinholdtsen commented 9 years ago

Without any feedback, I tried a wild guess, created a new branch with one commit with a different commit message, and submitted it as pull request #78. If that is a better appraoch, this request can be closed.

scraperdragon commented 9 years ago

Accepted #78