sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 901 forks source link

Output of #to_xml munged beyond certain file size using UTF-16 declaration #752

Closed Phrogz closed 2 years ago

Phrogz commented 12 years ago

For more details see http://stackoverflow.com/q/12162548/405017

Given a file on disk with UTF-16LE encoding and the contents:

<?xml version="1.0" encoding="UTF-16" ?>
<Foo>
  <Bar><![CDATA[ (...3906 characters...) ]]></Bar>
  <Jim>Oh! Hello there.</Jim>
</Foo>

The output of reading in this file and calling to_xml is broken:

require 'nokogiri'
xml = File.open('Simplified.xml','rb:utf-16',&:read)
doc1 = Nokogiri.XML(xml,&:noblanks)
xml1 = doc1.to_xml.encode('utf-8')
p xml1
#=> "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<Foo>\n  <Bar><![CDATA[ ... ]]></Bar>\n  <Jim>Oh! Hello there.\uFFFE\u3C00\u0000\u2F00\u0000\u4A00\u0000\u6900\u0000\u6D00\u0000\u3E00\u0000\u0A00\u0000\u3C00\u0000\u2F00\u0000\u4600\u0000\u6F00\u0000\u6F00\u0000\u3E00\u0000\u0A00\u0000"

Nokogiri 1.5.5 on Ruby 1.9.3p194 (2012-04-20) [i386-mingw32] on Windows 7

Phrogz commented 12 years ago

An alternative fix/workaround comes from the Stack Overflow question. Instead of:

xml1 = doc1.to_xml.encode('utf-8')

...use:

xml1 = doc1.to_xml(encoding:'utf-8')

This produces non-munged output.

flavorjones commented 2 years ago

@Phrogz, thanks for opening this issue and apologies for the embarrassingly long time it's taken to respond.

This is likely a libxml2 parsing bug. It feels similar in nature to these:

and I'll try to fix and send a PR upstream ... might need a few days.

flavorjones commented 2 years ago

Phew, this was a tricky one to figure out, but it turns out that Nokogiri wasn't using the proper encoding after libxml2 flushed its internal buffer for the first time. As long as a UTF-16 document was longer than ~4000 code points, this bug would be triggered.

See https://github.com/sparklemotion/nokogiri/pull/2434/commits/2e260f53e6b84b8f9c1b115b0ded85eebc8155d7 for the fix, and #2434 for the PR.

flavorjones commented 2 years ago

Fixed by #2434, will be in the next minor release of Nokogiri (v1.14.0)

flavorjones commented 2 years ago

Also see related #2447