petewarden / dstk

A collection of the best open data sets and open-source tools for data science
http://www.datasciencetoolkit.org/
1.13k stars 184 forks source link

html2story UTF-8 issue #1

Open petewarden opened 13 years ago

petewarden commented 13 years ago

From email:

I had just tried to mess with the html2story api, and sent an UTF-8 encoded html string in. The results were great, except all the accented characters (e.g. [áéíóöőúüű] - all the Hungarian vowels) where sent back as "??".

petewarden commented 13 years ago

Here's a page that reproduces the problem:

http://nol.hu/belfold/20110326-kontur_pal__a_telt_haz

petewarden commented 13 years ago

[cc-ed from email to reporter]

I wasn't able to reproduce it in the first test I tried, so I must be doing different steps. I wondered if I could get some more details from you? Here's what I'm trying:

Running OS X 10.6.6, in Terminal.app: curl "http://nol.hu/belfold/20110326-kontur_pal__a_telt_haz" > tests/data/hungarian.html html2story tests/data/hungarian.html

I see results like:

tasika | 2011. március 26. | 19:57:52 KOORMI001. MILYEN LÓRÓL BESZÉLSZ ? ÉN BÍZOK BENNE , HOGY LÓ ÉS SZAMÁR KEVERÉK ! ...

Which operating system and steps are you using?

petewarden commented 13 years ago

I've found what the difference was. I was running a local server on my OS X machine, but when I use the main http://www.datasciencetoolkit.org server, I see the ??'s.

petewarden commented 13 years ago

It looks like it was related to the default file-encoding assumed by Java. I added a switch to the command line running boilerpipe so that it would guess UTF-8, and it now seems to work.

For version 0.40