Open petewarden opened 13 years ago
Here's a page that reproduces the problem:
[cc-ed from email to reporter]
I wasn't able to reproduce it in the first test I tried, so I must be doing different steps. I wondered if I could get some more details from you? Here's what I'm trying:
Running OS X 10.6.6, in Terminal.app: curl "http://nol.hu/belfold/20110326-kontur_pal__a_telt_haz" > tests/data/hungarian.html html2story tests/data/hungarian.html
I see results like:
tasika | 2011. március 26. | 19:57:52 KOORMI001. MILYEN LÓRÓL BESZÉLSZ ? ÉN BÍZOK BENNE , HOGY LÓ ÉS SZAMÁR KEVERÉK ! ...
Which operating system and steps are you using?
I've found what the difference was. I was running a local server on my OS X machine, but when I use the main http://www.datasciencetoolkit.org server, I see the ??'s.
It looks like it was related to the default file-encoding assumed by Java. I added a switch to the command line running boilerpipe so that it would guess UTF-8, and it now seems to work.
For version 0.40
From email:
I had just tried to mess with the html2story api, and sent an UTF-8 encoded html string in. The results were great, except all the accented characters (e.g. [áéíóöőúüű] - all the Hungarian vowels) where sent back as "??".