Closed muharremozkan closed 12 years ago
Well internally libxml2 uses only UTF-8 as encoding format. Thus every initWithHTMLData method assumes that the data is encoded in a proper format (UTF-8 or UTF-16) As far as I know some Turkish characters are part of the Unicode character set, so this shouldn't be the problem
Actually the parser treat the turkish characters as content and should not change them. Depending on the language preferences of the computer, they should appear as Icelandic or turkish characters on the screen
But if the websites you want to parse uses a different encoding, you will have to convert the data to UTF-8 with iconv or the libxml2 encoding module before parsing them. Unfortunately you generally don't know the content of the html meta tags before you parse the entire website, but you could make an educated guess of the charset by searching for patterns in the raw data. Maybe for "ISO 8859-9".
For more information on libxml2 internationalization support: http://xmlsoft.org/encoding.html
Hi,
I have problem with encoding of html pages. For example Turkish characters encodes wrong. How can I change the encoding acording to the html meta tags?
Best regards.