senjuhashirama / pugixml

Automatically exported from code.google.com/p/pugixml
0 stars 0 forks source link

Implement latin1 (ISO-8859-1) encoding autodetection #192

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

When saving an ISO-8859-1 input to a buffer using a custom writer
then Umlauts get lost. Umlaut chars have codes 128..255
in the extended ascii-table, here ISO-8859-1, aka latin1 or windows-1252.

   const string sInp = "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n<text>Umlauts &amp; friends: ÄäÖöÜüß 123</text>\n";

   pugi::xml_document doc;
   pugi::xml_parse_result res = doc.load_buffer(sInp.c_str(), sInp.length(), pugi::parse_default, pugi::encoding_auto);

// const string sText = doc.child_value("text");  // OK

   // save as latin1 to a string buffer via a custom writer:
   my_xml_string_writer sw;
   doc.save(sw, "  ", pugi::format_default, pugi::encoding_latin1);

   // THE BUG IS: the umlauts are missing!

this is the custom writer:

struct my_xml_string_writer : pugi::xml_writer
  {
    std::string result;
    virtual void write(const void* data, size_t size)
      {
        result += std::string(static_cast<const char*>(data), size);
      }
  };

What is the expected output? What do you see instead?

I expect the umlauts to be in the output. They are missing.

Which version of pugixml are you using? On what operating system/compiler?

pugixml v1.2, OS: Windows XP, Compiler: visual studio 2008 (C++).
Happens also under Debian Linux using latest g++.

Please provide any additional information below.

The above Umlauts were typed in using windows codepage 1252, ie. practically 
it's ISO-8859-1. 

Original issue reported on code.google.com by wernero...@googlemail.com on 3 Jan 2013 at 10:02

GoogleCodeExporter commented 9 years ago
Unfortunately, the current encoding auto-detection can't recognize latin1 
inputs, so it assumes UTF-8. The likely cause is that the encoding is detected 
as UTF-8, the umlaut sequences are invalid UTF-8 so they get discarded on UTF-8 
-> latin1 conversion that happens on save.

Can you check if explicitly specifying latin1 encoding during loading fixes the 
issue?

Original comment by arseny.k...@gmail.com on 3 Jan 2013 at 10:54

GoogleCodeExporter commented 9 years ago
> Can you check if explicitly specifying latin1 encoding during loading fixes 
the issue?

Yes, that did the trick!
Thank you very much!

Original comment by wernero...@googlemail.com on 4 Jan 2013 at 12:01

GoogleCodeExporter commented 9 years ago
Changing type to Enhancement to reflect the fact that Latin1 auto-detection is 
not implemented (which is explicitly mentioned in documentation)

Original comment by arseny.k...@gmail.com on 27 Jan 2014 at 12:50

GoogleCodeExporter commented 9 years ago

Original comment by arseny.k...@gmail.com on 8 Feb 2014 at 11:14

GoogleCodeExporter commented 9 years ago
This issue was moved to GitHub: https://github.com/zeux/pugixml/issues/16

Original comment by arseny.k...@gmail.com on 26 Oct 2014 at 9:08