rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Crash on BOM signature since v2.0.3 #157

Closed rafbm closed 3 years ago

rafbm commented 3 years ago

Given the following test file:

# encoding: utf-8
require 'nokogumbo'
require 'minitest/autorun'

class TestNokogumbo < Minitest::Test
  def test_bom
    html = "\xEF\xBB\xBF"
    p Nokogiri::HTML5.fragment(html).to_html

    html = "\xEF\xBB\xBFb"
    p Nokogiri::HTML5.fragment(html).to_html

    html = "\xEF\xBB\xBF<b"
    p Nokogiri::HTML5.fragment(html).to_html

    html = "\xEF\xBB\xBF<b>"
    p Nokogiri::HTML5.fragment(html).to_html
  end
end

The last call to .fragment crashes:

""
"b"
""
Assertion failed: (output->original_text.data[0] == '<'), function emit_current_tag, file ../../../../ext/nokogumbo/../../gumbo-parser/src/tokenizer.c, line 495.
stevecheckoway commented 3 years ago

Well that's annoying! I'll take a look!

stevecheckoway commented 3 years ago

The fix turned out to be quite simple, but tracking it down was harder than it should have been. Everything looked correct in the debugger before I realized that lldb wasn't printing the BOM when printing the strings.

Thanks for providing the test case! I adapted it for testing the underlying gumbo parser as well as testing in Ruby.

rafbm commented 3 years ago

Thanks so much for acting on this so quickly! Are you planning on releasing a gem update? These C-level crashes are so bad… I’m nervous for anyone who would update to 2.0.3 at the moment. 😓

stevecheckoway commented 3 years ago

2.0.2 misbehaved in the presence of a UTF-8 BOM as well (although it did it silently, without any assertion failures).

The work around for both versions is quite simple.

irb(main):001:0> require 'nokogumbo'
=> true
irb(main):002:0> Nokogumbo::VERSION
=> "2.0.3"
irb(main):003:0> html = "\uFEFF<b>BOM</b>"
=> "<b>BOM</b>"
irb(main):004:0> Nokogiri::HTML5.fragment(html.delete_prefix("\uFEFF"))
=> #<Nokogiri::HTML5::DocumentFragment:0x438 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x424 name="b" children=[#<Nokogiri::XML::Text:0x410 "BOM">]>]>

But yes, I'll release a new update soon. I want to take a look at another issue that arose first.

stevecheckoway commented 3 years ago

Once the CI tests pass for https://github.com/rubys/nokogumbo/pull/161, I'll release a new gem.