rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Doesn’t handle Unicode signature/byte-order-mark #133

Closed da2x closed 4 years ago

da2x commented 4 years ago

Returns the following unexpected errors when encountering UTF BOM/signatures.

1:1: ERROR: Expected a doctype token
<!DOCTYPE html>
^
1:2: ERROR: This is not a legal doctype
<!DOCTYPE html>
 ^

Expected behaviour: Check the first bytes of the document and detect BOM byte sequence. Set the document encoding to the encoding indicated by the BOM sequence (e.g. UTF-8 or UTF-16 LE). Strip the BOM sequence and proceed with parsing the document as normal.

https://encoding.spec.whatwg.org/#decode https://html.spec.whatwg.org/#writing

Some test cases:

UTF-8 signature mark:

Nokogiri::HTML5.parse(
  "\xEF\xBB\xBF<!DOCTYPE html>\n<html></html>".
  force_encoding('UTF-8'),
  max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (BE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFE\xFF".force_encoding('UTF-16BE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16BE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }

UTF-16 (LE) byte-order-mark:

Nokogiri::HTML5.parse(
    "\xFF\xEF".force_encoding('UTF-16LE') +
    "<!DOCTYPE html>\n<html></html>".
    encode('UTF-16LE', 'UTF-8'),
    max_errors: 10).
errors.each { |err| puts(err) }
stevecheckoway commented 4 years ago

Thank you for the bug report. I've got a fix that should land soon (assuming all the tests pass).