Open conradlee opened 14 years ago
Hmm, well when it's displayed in this issue form, it is being displayed correctly. However, when I open the output file in a text editor that is set to use UTF8 as its default encoding, the name appears as "Cirl & eacute ; sio Cunha" , [I inserted the spaces so github doesn't correect it again]
You're right, it's actually pulling the raw HTML from the page, so you end up with entity encoding for non-ASCII characters. As a temporary workaround I actually decode these using html_entity_decode() when I'm doing further processing on the output data, but I need to put that into the crawler itself. I'll add that into the script and check it in once I've tested it.
Incidentally, my very first bug, I'm stoked! :) Thanks for reporting, it's good to know people are actually using it.
Ok, thanks for that answer. Knowing that, I know how to correctly decode the text in python. Maybe this isn't a bug after all, more like a missing feature (utf8 encoding).
Is this scraper set up to properly encode utf8? Somehwere in my use of the scaper, the character encoding is getting messed up. Here's a simple example.
Let's assume we want to scrape the id 105058448954104555632
If you go to www.google.com/profiles/105058448954104555632 then you will see that the name is one that won't work in an ascii representation. If I use your scraper to get the data from this page, it returns
105058448954104555632 {"name":"Cirlésio Cunha","location":"Blumenau","mentions":["105058448954104555632","105058448954104555632"]}
Note that the name is garbled.