petewarden / buzzprofilecrawl

A simple script to crawl Google Profile pages and extract their information as structured data
http://petewarden.typepad.com/
89 stars 13 forks source link

UTF-8 not correctly encoded #1

Open conradlee opened 14 years ago

conradlee commented 14 years ago

Is this scraper set up to properly encode utf8? Somehwere in my use of the scaper, the character encoding is getting messed up. Here's a simple example.

Let's assume we want to scrape the id 105058448954104555632

If you go to www.google.com/profiles/105058448954104555632 then you will see that the name is one that won't work in an ascii representation. If I use your scraper to get the data from this page, it returns

105058448954104555632 {"name":"Cirlésio Cunha","location":"Blumenau","mentions":["105058448954104555632","105058448954104555632"]}

Note that the name is garbled.

conradlee commented 14 years ago

Hmm, well when it's displayed in this issue form, it is being displayed correctly. However, when I open the output file in a text editor that is set to use UTF8 as its default encoding, the name appears as "Cirl & eacute ; sio Cunha" , [I inserted the spaces so github doesn't correect it again]

petewarden commented 14 years ago

You're right, it's actually pulling the raw HTML from the page, so you end up with entity encoding for non-ASCII characters. As a temporary workaround I actually decode these using html_entity_decode() when I'm doing further processing on the output data, but I need to put that into the crawler itself. I'll add that into the script and check it in once I've tested it.

Incidentally, my very first bug, I'm stoked! :) Thanks for reporting, it's good to know people are actually using it.

conradlee commented 14 years ago

Ok, thanks for that answer. Knowing that, I know how to correctly decode the text in python. Maybe this isn't a bug after all, more like a missing feature (utf8 encoding).