Closed Phrogz closed 2 years ago
An alternative fix/workaround comes from the Stack Overflow question. Instead of:
xml1 = doc1.to_xml.encode('utf-8')
...use:
xml1 = doc1.to_xml(encoding:'utf-8')
This produces non-munged output.
@Phrogz, thanks for opening this issue and apologies for the embarrassingly long time it's taken to respond.
This is likely a libxml2 parsing bug. It feels similar in nature to these:
and I'll try to fix and send a PR upstream ... might need a few days.
Phew, this was a tricky one to figure out, but it turns out that Nokogiri wasn't using the proper encoding after libxml2 flushed its internal buffer for the first time. As long as a UTF-16 document was longer than ~4000 code points, this bug would be triggered.
See https://github.com/sparklemotion/nokogiri/pull/2434/commits/2e260f53e6b84b8f9c1b115b0ded85eebc8155d7 for the fix, and #2434 for the PR.
Fixed by #2434, will be in the next minor release of Nokogiri (v1.14.0)
Also see related #2447
For more details see http://stackoverflow.com/q/12162548/405017
Given a file on disk with UTF-16LE encoding and the contents:
The output of reading in this file and calling
to_xml
is broken:<Bar>
CDATA, the output is fixed.I can query and serialize elements that are munged in the output just fine:
If I remove the XML declaration from the input before parsing the document, the output is fixed:
Nokogiri 1.5.5 on Ruby 1.9.3p194 (2012-04-20) [i386-mingw32] on Windows 7