scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 273 forks source link

iso-8859-1 #3

Closed ghost closed 13 years ago

ghost commented 13 years ago

Trying to scrape pages with a content-encoding of iso-8859-1 throws a unicode error:

url1 = 'http://www[DOT]getmobile[DOT]de/handy/NO68128,Nokia-C3-01-Touch-and-Type.html' #url changed to prevent backlinking data = {'name': 'Nokia C3-01 Touch and Type', 'price': '129,00'} s.train(url1,data) Traceback (most recent call last): File "", line 1, in File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 32, in train File "build/bdist.macosx-10.6-universal/egg/scrapely/init.py", line 50, in _get_page File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1512-1514: invalid data

pablohoffman commented 13 years ago

Did you try passing encoding='iso-8859-1'argument to train() ?

ghost commented 13 years ago

Ah sorry - missed that in the Documentation. It works :-)

an auto-detect from http response or html meta tag would be nice to have.

pablohoffman commented 13 years ago

Yes, though the scrapely library is really meant to be used with unicode.

For auto-detecting response encodings you can use other libraries that already do that job very well (like Scrapy) since it's not trivial at all to discover web pages encoding reliably.