graetzer / GDataXML-HTML

HTML and XML parser for iOS and OSX, supports XPath
Apache License 2.0
289 stars 121 forks source link

HTML Encoding #1

Closed muharremozkan closed 12 years ago

muharremozkan commented 12 years ago

Hi,

I have problem with encoding of html pages. For example Turkish characters encodes wrong. How can I change the encoding acording to the html meta tags?

Best regards.

graetzer commented 12 years ago

Well internally libxml2 uses only UTF-8 as encoding format. Thus every initWithHTMLData method assumes that the data is encoded in a proper format (UTF-8 or UTF-16) As far as I know some Turkish characters are part of the Unicode character set, so this shouldn't be the problem

Actually the parser treat the turkish characters as content and should not change them. Depending on the language preferences of the computer, they should appear as Icelandic or turkish characters on the screen

But if the websites you want to parse uses a different encoding, you will have to convert the data to UTF-8 with iconv or the libxml2 encoding module before parsing them. Unfortunately you generally don't know the content of the html meta tags before you parse the entire website, but you could make an educated guess of the charset by searching for patterns in the raw data. Maybe for "ISO 8859-9".

For more information on libxml2 internationalization support: http://xmlsoft.org/encoding.html