sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

[bug] Invalid handling of ISO-8859-1 encoded documents on Nokogiri >= 1.16.0 #3310

Closed KirtashW17 closed 2 weeks ago

KirtashW17 commented 2 weeks ago

Please describe the bug

With Nokogiri >= 1.16.0 I detected a strange behavior when handling ISO-8859-1 XML documents: content is handled as UTF-8, and invalid characters are replaced for valid bytes in UTF-8, so there is no way to obtain the original content.

Help us reproduce what you're seeing

I attach a simple ISO-8859 XML example (packed in a ZIP. GH doesn't allow xml files) test.xml.zip

payload = File.read('test.xml')
=> "<?xml version='1.0' encoding='iso-8859-1'?>\n<root>\n  <foo>\xE1\xE9\xED\xF3\xFA \xE0\xE8\xEC\xF2\xF9 \xE4\xEB\xEF\xF6\xFC</foo>\n</root>"
payload.valid_encoding?
=> false
payload.force_encoding('ISO-8859-1').valid_encoding?
=> true
payload.force_encoding('ISO-8859-1').encode('UTF-8')
=> "<?xml version='1.0' encoding='iso-8859-1'?>\n<root>\n  <foo>áéíóú àèìòù äëïöü</foo>\n</root>"
Nokogiri.XML(payload).to_xml
=> "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>\n  <foo>\xE1\xE9\xED\xF3\xFA \xE0\xE8\xEC\xF2\xF9 \xE4\xEB\xEF\xF6\xFC</foo>\n</root>\n"
Nokogiri.XML(payload).to_xml.valid_encoding?
=> true
Nokogiri.XML(payload, nil, 'ISO-8859-1').to_xml
=> "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n<root>\n  <foo>\xE1\xE9\xED\xF3\xFA \xE0\xE8\xEC\xF2\xF9 \xE4\xEB\xEF\xF6\xFC</foo>\n</root>\n"

Expected behavior

I expect to get the original content of the XML file. XML content should be interpreted as ISO-8859-1 and later converted to UTF-8, or event to get an UTF-8 string with invalid bytes that later I can interpret as ISO-8859-1

Environment

Debian GNU/Linux 12 (bookworm)
1.16.0 <= Nokogiri version <= 1.16.7
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
Linux Kernel 6.1.55+
# Nokogiri (1.16.7)
    ---
    warnings: []
    nokogiri:T6
      version: 1.16.7
      cppflags:
      - "-I/home/tscalise/git/ws/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.16.7-x86_64-linux/ext/nokogiri"
      - "-I/home/tscalise/git/ws/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.16.7-x86_64-linux/ext/nokogiri/include"
      - "-I/home/tscalise/git/ws/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.16.7-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.2
      platform: x86_64-linux-gnu
      gem_platform: x86_64-linux
      description: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.12.9
      loaded: 2.12.9
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.39
      loaded: 1.1.39
    other_libraries:
      zlib: 1.3.1
      libgumbo: 1.0.0-nokogiri
KirtashW17 commented 2 weeks ago

My bad. I didn't expected to get an ISO-8859-1. I understood that Nokogiri now return's strings in the encoding of the XML (or the one explicitly passed as third positional parameter), and it will replace all invalid characters for the given encoding.