explore: removing or overhauling the EncodingReader

flavorjones commented 2 years ago

The Nokogiri::HTML4::EncodingReader class is used to try to detect encoding of HTML4 documents when they have ambiguous encoding.

Recently, a REDOS vulnerability was found in this code. There are other regular expressions which should be vetted; and we should explore replacing some of those regexes with simpler calls like String#include?.

This class was written during a time (Ruby 1.9) when Ruby strings were encoded as ASCII-8BIT by default. This hasn't been true since (I think) Ruby 2.0, and so this complexity may only be for an edge case that we no longer need to support; and so maybe we can remove the entire class thereby simplifying both CRuby and JRuby implementations.

flavorjones commented 1 year ago

Perhaps more specifically: let's consider unifying the encoding detection algorithm from the HTML5 parser and the HTML4 parser.

flavorjones commented 1 year ago

@stevecheckoway notes that the HTML5 encoding detection is incomplete with respect to https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding

sparklemotion / nokogiri

explore: removing or overhauling the EncodingReader #2513