Closed GoogleCodeExporter closed 8 years ago
Relying on UTF-8 as the default would be plain wrong.
According to RFC 2616 (HTTP/1.1), ISO-8859-1 is the default charset encoding.
We're already relaxing it to Win Cp1252.
If you need to change the default encoding for your setup, simply adjust the
following line in the HTMLFetcher class:
Charset cs = Charset.forName("Cp1252");
Original comment by ckkohl79
on 7 Jul 2011 at 1:45
I understand that defaulting to utf-8 could be wrong.
However, when the source of
http://www.buddymedia.com/newsroom/2011/06/hearst-magazines-digital-media-partne
rs-with-buddy-media-to-launch-a-scalable-social-platform-on-facebook-for-thirtee
n-hearst-brands/#more-10378 is passed as html string 'a' to
ast.INSTANCE.getText(a) (where ast is an ArticleExtractor object), it creates
the same problem. Input seems to be interpreted as Latin-1. How can that be
fixed or How can I make it default to utf-8 ?
Thanks.
Original comment by amita...@gmail.com
on 2 Aug 2011 at 3:28
Original issue reported on code.google.com by
tonio.wa...@gmail.com
on 14 Jun 2011 at 4:14