HTML Encoding - Githubissues

Well internally libxml2 uses only UTF-8 as encoding format. Thus every initWithHTMLData method assumes that the data is encoded in a proper format (UTF-8 or UTF-16) As far as I know some Turkish characters are part of the Unicode character set, so this shouldn't be the problem

Actually the parser treat the turkish characters as content and should not change them. Depending on the language preferences of the computer, they should appear as Icelandic or turkish characters on the screen

But if the websites you want to parse uses a different encoding, you will have to convert the data to UTF-8 with iconv or the libxml2 encoding module before parsing them. Unfortunately you generally don't know the content of the html meta tags before you parse the entire website, but you could make an educated guess of the charset by searching for patterns in the raw data. Maybe for "ISO 8859-9".

For more information on libxml2 internationalization support: http://xmlsoft.org/encoding.html

graetzer / GDataXML-HTML

HTML Encoding #1