Open flavorjones opened 2 years ago
Perhaps more specifically: let's consider unifying the encoding detection algorithm from the HTML5 parser and the HTML4 parser.
@stevecheckoway notes that the HTML5 encoding detection is incomplete with respect to https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
The Nokogiri::HTML4::EncodingReader class is used to try to detect encoding of HTML4 documents when they have ambiguous encoding.
Recently, a REDOS vulnerability was found in this code. There are other regular expressions which should be vetted; and we should explore replacing some of those regexes with simpler calls like
String#include?
.This class was written during a time (Ruby 1.9) when Ruby strings were encoded as ASCII-8BIT by default. This hasn't been true since (I think) Ruby 2.0, and so this complexity may only be for an edge case that we no longer need to support; and so maybe we can remove the entire class thereby simplifying both CRuby and JRuby implementations.