sparklemotion / mechanize

Mechanize is a ruby library that makes automated web interaction easy.
https://www.rubydoc.info/gems/mechanize/
MIT License
4.39k stars 473 forks source link

libxml2 2.11.x emitting error "FATAL: input conversion failed due to input error" on encoding errors #613

Closed flavorjones closed 1 year ago

flavorjones commented 1 year ago

Nokogiri v1.15.x updated libxml2 to 2.11.x. The test suite is now failing on some bad-encoding-related tests:

  1) Error:
TestMechanizePageLink#test_encoding_charset_after_title_bad:
Nokogiri::XML::SyntaxError: Parser without recover option encountered error or warning: FATAL: input conversion failed due to input error, bytes 0x86 0xE3 0x82 0xB9
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4/document.rb:209:in `read_memory'
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4/document.rb:209:in `parse'
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4.rb:24:in `parse'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:121:in `block in parser'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:120:in `reverse_each'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:120:in `parser'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:100:in `encoding_error?'
    /home/flavorjones/code/oss/mechanize/test/test_mechanize_page_link.rb:116:in `test_encoding_charset_after_title_bad'

  2) Error:
TestMechanizePageLink#test_encoding_charset_bad:
Nokogiri::XML::SyntaxError: Parser without recover option encountered error or warning: FATAL: input conversion failed due to input error, bytes 0x86 0xE3 0x82 0xB9
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4/document.rb:209:in `read_memory'
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4/document.rb:209:in `parse'
    /home/flavorjones/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/nokogiri-1.15.1-x86_64-linux/lib/nokogiri/html4.rb:24:in `parse'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:121:in `block in parser'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:120:in `reverse_each'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:120:in `parser'
    /home/flavorjones/code/oss/mechanize/lib/mechanize/page.rb:100:in `encoding_error?'
    /home/flavorjones/code/oss/mechanize/test/test_mechanize_page_link.rb:140:in `test_encoding_charset_bad'

Need to investigate.

flavorjones commented 1 year ago

Here's the test case:

#! /usr/bin/env ruby
# mechanize #613

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

puts Nokogiri::VERSION_INFO

UTF8_TITLE = 'テスト'
UTF8 = <<-HTML
<title>#{UTF8_TITLE}</title>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
HTML

doc = Nokogiri::HTML4.parse(UTF8.force_encoding(Encoding::BINARY))
pp doc
pp doc.to_html
pp doc.errors

Success case is parsing the doc, failure case is raising an exception because htmlReadMemory returns NULL.

Git bisect says:

b167c7314497b6cb0d9a587a31874ae0d273ffaa is the first bad commit
commit b167c7314497b6cb0d9a587a31874ae0d273ffaa
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Tue Mar 14 14:42:36 2023 +0100

    parser: Fix short-lived regression causing infinite loops

    Fix 3eb6bf03. We really have to halt the parser, so the input buffer
    gets reset.

 include/private/parser.h |  2 ++
 parser.c                 | 37 ------------------------------------
 parserInternals.c        | 49 +++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 42 insertions(+), 46 deletions(-)
flavorjones commented 1 year ago

Bug report filed upstream here: https://gitlab.gnome.org/GNOME/libxml2/-/issues/543