sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 901 forks source link

[help] Nokogiri stops parsing @ ASCII-8BIT \xA0 #2511

Closed aaronstillwell closed 2 years ago

aaronstillwell commented 2 years ago

What problem are you trying to solve?

I'm trying to parse a HTML document, encoded with ASCII-8BIT which features the \xA0 character. Nokogiri seems to quietly stop parsing the document when encountering such a character, as per the examples below.

Is this expected behaviour? If so, what the recommended approach to parsing documents correctly encoded with this character?

Please show your code!

#! /usr/bin/env ruby

require 'nokogiri'

html = <<-EOF
<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n

<head>\r\n
  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=us-ascii\" />\r\n<title></title>\r\n
  <meta name=\"generator\" content=\"foo\" />\r\n
</head>\r\n

<body>
  <div style=\"text-align:left;font-family:Arial, Helvetica, sans-serif; font-size: 10pt;\">
    <div>foo</div>
    <div>bar</div>
    <div>foo</div>
    <div>\xA0</div>
  </div>\r\n
  <div style=\"text-align:left;font-family:Arial, Helvetica, sans-serif; font-size: 10pt;\">
    <div>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque eget aliquet dui. Integer consectetur
      ante et
      libero semper, et egestas leo lacinia. In pretium lorem nec varius accumsan. Maecenas ornare bibendum tempus. Ut
      ultricies viverra tellus, et facilisis metus posuere eget. Fusce nec est magna. Curabitur ultricies turpis urna,
      sagittis efficitur eros aliquet vitae. Fusce commodo dictum turpis. Integer a ex laoreet, vehicula arcu ac,
      posuere
      eros. Aliquam ut quam fermentum, porttitor libero in, sagittis tellus. Donec vel ante turpis. Quisque scelerisque
      enim
      at enim ultricies porttitor.</div>
    <div>\xA0</div>
  </div>
  <div>\xA0</div>
  <div>\xA0</div>
</body>\r\n

</html>
EOF

doc = Nokogiri::HTML(html.force_encoding('ASCII-8BIT'))
pp doc
# #(Document:0x3fd3d380aa20 {
#   name = "document",
#   children = [
#     #(DTD:0x3fd3d38e3528 { name = "html" }),
#     #(Element:0x3fd3d38e3514 {
#       name = "html",
#       attributes = [
#         #(Attr:0x3fd3d38e26b4 {
#           name = "xmlns",
#           value = "http://www.w3.org/1999/xhtml"
#           })],
#       children = [
#         #(Text "\r\n" + "\n" + "\n"),
#         #(Element:0x3fd3d38e2650 {
#           name = "head",
#           children = [
#             #(Text "\r\n" + "\n" + "  "),
#             #(Element:0x3fd3d38ea56c {
#               name = "meta",
#               attributes = [
#                 #(Attr:0x3fd3d38ee9b4 {
#                   name = "http-equiv",
#                   value = "Content-Type"
#                   }),
#                 #(Attr:0x3fd3d38ee9a0 {
#                   name = "content",
#                   value = "text/html; charset=us-ascii"
#                   })]
#               }),
#             #(Text "\r\n"),
#             #(Element:0x3fd3d38ea4e0 { name = "title" }),
#             #(Text "\r\n" + "\n" + "  "),
#             #(Element:0x3fd3d38ea3dc {
#               name = "meta",
#               attributes = [
#                 #(Attr:0x3fd3d38f7460 { name = "name", value = "generator" }),
#                 #(Attr:0x3fd3d38f744c { name = "content", value = "foo" })]
#               }),
#             #(Text "\r\n" + "\n")]
#           }),
#         #(Text "\r\n" + "\n" + "\n"),
#         #(Element:0x3fd3d38e2614 {
#           name = "body",
#           children = [
#             #(Text "\n" + "  "),
#             #(Element:0x3fd3d38faed0 {
#               name = "div",
#               attributes = [
#                 #(Attr:0x3fd3d38fa278 {
#                   name = "style",
#                   value = "text-align:left;font-family:Arial, Helvetica, sans-serif; font-size: 10pt;"
#                   })],
#               children = [
#                 #(Text "\n" + "    "),
#                 #(Element:0x3fd3d38fa228 {
#                   name = "div",
#                   children = [ #(Text "foo")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3fd3d38fa200 {
#                   name = "div",
#                   children = [ #(Text "bar")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3fd3d38fa1d8 {
#                   name = "div",
#                   children = [ #(Text "foo")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3fd3d38fa1b0 { name = "div" })]
#               })]
#           })]
#       })]
#   })

doc = Nokogiri::HTML(html.force_encoding('UTF-8'))
pp doc
# #(Document:0x3ffa56092b2c {
#   name = "document",
#   children = [
#     #(DTD:0x3ffa56092460 { name = "html" }),
#     #(Element:0x3ffa5609244c {
#       name = "html",
#       attributes = [
#         #(Attr:0x3ffa5741733c {
#           name = "xmlns",
#           value = "http://www.w3.org/1999/xhtml"
#           })],
#       children = [
#         #(Text "\r\n" + "\n" + "\n"),
#         #(Element:0x3ffa574172c4 {
#           name = "head",
#           children = [
#             #(Text "\r\n" + "\n" + "  "),
#             #(Element:0x3ffa5741f7a8 {
#               name = "meta",
#               attributes = [
#                 #(Attr:0x3ffa5741e664 {
#                   name = "http-equiv",
#                   value = "Content-Type"
#                   }),
#                 #(Attr:0x3ffa5741e600 {
#                   name = "content",
#                   value = "text/html; charset=us-ascii"
#                   })]
#               }),
#             #(Text "\r\n"),
#             #(Element:0x3ffa5741f76c { name = "title" }),
#             #(Text "\r\n" + "\n" + "  "),
#             #(Element:0x3ffa5741f744 {
#               name = "meta",
#               attributes = [
#                 #(Attr:0x3ffa5609b63c { name = "name", value = "generator" }),
#                 #(Attr:0x3ffa5609b628 { name = "content", value = "foo" })]
#               }),
#             #(Text "\r\n" + "\n")]
#           }),
#         #(Text "\r\n" + "\n" + "\n"),
#         #(Element:0x3ffa57417288 {
#           name = "body",
#           children = [
#             #(Text "\n" + "  "),
#             #(Element:0x3ffa5609ebe8 {
#               name = "div",
#               attributes = [
#                 #(Attr:0x3ffa560a34b8 {
#                   name = "style",
#                   value = "text-align:left;font-family:Arial, Helvetica, sans-serif; font-size: 10pt;"
#                   })],
#               children = [
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560a3468 {
#                   name = "div",
#                   children = [ #(Text "foo")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560a3418 {
#                   name = "div",
#                   children = [ #(Text "bar")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560a33f0 {
#                   name = "div",
#                   children = [ #(Text "foo")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560a33c8 {
#                   name = "div",
#                   children = [ #(Text "\xA0")]
#                   }),
#                 #(Text "\n" + "  ")]
#               }),
#             #(Text "\r\n" + "\n" + "  "),
#             #(Element:0x3ffa5609ebac {
#               name = "div",
#               attributes = [
#                 #(Attr:0x3ffa560b34f8 {
#                   name = "style",
#                   value = "text-align:left;font-family:Arial, Helvetica, sans-serif; font-size: 10pt;"
#                   })],
#               children = [
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560b3430 {
#                   name = "div",
#                   children = [
#                     #(Text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque eget aliquet dui. Integer consectetur\n" +
#                       "      ante et\n" +
#                       "      libero semper, et egestas leo lacinia. In pretium lorem nec varius accumsan. Maecenas ornare bibendum tempus. Ut\n" +
#                       "      ultricies viverra tellus, et facilisis metus posuere eget. Fusce nec est magna. Curabitur ultricies turpis urna,\n" +
#                       "      sagittis efficitur eros aliquet vitae. Fusce commodo dictum turpis. Integer a ex laoreet, vehicula arcu ac,\n" +
#                       "      posuere\n" +
#                       "      eros. Aliquam ut quam fermentum, porttitor libero in, sagittis tellus. Donec vel ante turpis. Quisque scelerisque\n" +
#                       "      enim\n" +
#                       "      at enim ultricies porttitor.")]
#                   }),
#                 #(Text "\n" + "    "),
#                 #(Element:0x3ffa560b33cc {
#                   name = "div",
#                   children = [ #(Text "\xA0")]
#                   }),
#                 #(Text "\n" + "  ")]
#               }),
#             #(Text "\n" + "  "),
#             #(Element:0x3ffa5609eb84 {
#               name = "div",
#               children = [ #(Text "\xA0")]
#               }),
#             #(Text "\n" + "  "),
#             #(Element:0x3ffa5609eb48 {
#               name = "div",
#               children = [ #(Text "\xA0")]
#               }),
#             #(Text "\n")]
#           }),
#         #(Text "\r\n" + "\n" + "\n")]
#       })]
#   })

Environment

# Nokogiri (1.13.3)
    ---
    warnings: []
    nokogiri:
      version: 1.13.3
      cppflags:
      - "-I/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/ext/nokogiri"
      - "-I/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/ext/nokogiri/include"
      - "-I/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.1
      platform: x86_64-linux
      gem_platform: x86_64-linux
      description: ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-linux]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - '0008-htmlParseComment-handle-abruptly-closed-comments.patch'
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-Revert-Different-approach-to-fix-quadratic-behavior.patch
      libxml2_path: "/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.13
      loaded: 2.9.13
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.35
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.11
      libgumbo: 1.0.0-nokogiri

Additional context

The way the HTML is presented in my use case is not within my control.

flavorjones commented 2 years ago

@aaronstillwell Hi! Thanks for asking this question, and sorry you're having trouble. I'll try to help.

First, it's important to note that the document encoding may be different from the string encoding. We see this here, where \xA0 is neither ASCII (which confuses libxml2 into terminating parsing) and isn't UTF-8 (which generates warnings for us).

The right thing to do here is to tell Nokogiri and libxml2 what the document's encoding is. To be fair, this document is encoded incorrectly -- I believe the author means to include a non-breaking space, but has done so in an incorrect way, likely trying to inject unicode bytes. The portable way to communicate a non-breaking space in an HTML document is to use an entity, one of &nbsp;, &#xA0;, or &#160;.

Anyway. If you want to avoid parsing errors or string encoding errors, try this:

doc = Nokogiri::HTML(html, nil, "ISO-8859-1")

which tells Nokogiri to treat the document as if it were in Latin-1 encoding, which allows the full 256 values of an 8-bit character (ASCII only allows the lower 128 values).

Does this all make sense? Happy to answer questions, encoding can be challenging particularly when things are going wrong.

stevecheckoway commented 2 years ago

I'm not sure what the correct behavior for XHTML is since it depends on the rules for XML which I don't know in great detail.

For HTML, I think the document is correct. The us-ascii charset is mapped to Windows 1252 which uses \xA0 for NBSP. But I don't know how the string encoding interacts with this.

flavorjones commented 2 years ago

Steve, you're right, I think, but there's some interesting behaviors that are interacting.

looking at libxml2's encoding.c it's not obvious to me why it doesn't handle this character. But ...

@aaronstillwell So another approach is to move to the HTML5 parser which handles this document fine.

doc = Nokogiri::HTML5(HTML)
stevecheckoway commented 2 years ago

Another option may be to covert \xA0 to \u00A0 which (if I got my Ruby syntax correct), should give the correct unicode character, but then the charset would need to be changed to utf-8. (Edit: Or I guess should be changed. Otherwise, you might have issues serializing the DOM and then reading it in a browser which will do the charset detection, decide on Windows 1252, and then fail to parse the multi-byte NBSP character.)

aaronstillwell commented 2 years ago

Thanks both for your input here.

@flavorjones is there any trade-off to using the HTML5 parser? I understand from the docs that this uses an alternative parser under the hood, however I assume given its recommendation here that despite the name, it can be used for all sorts of (X)HTML? If so that looks to me like the easiest fix for my use case.

Additionally, if I understand correctly that this is not expected behavior, and this stems from libxml, should I be raising an issue there?

aaronstillwell commented 2 years ago

@flavorjones just attempting your recommendation, but I see the following error:

/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/lib/nokogiri/html5.rb:389:in `encode': "\xA0" on US-ASCII (Encoding::InvalidByteSequenceError)

Is this expected?

stevecheckoway commented 2 years ago

@aaronstillwell One trade off is the modern HTML parser (the one not built using libxml2) does not support XHTML.

Ah, I think I see what the problem is (although my line numbers are a ways off).

          # change the encoding to match the detected or inferred encoding
          body = body.dup
          begin
            body.force_encoding(encoding)
          rescue ArgumentError
            body.force_encoding(Encoding::ISO_8859_1)
          end

I think it has detected the us-ascii charset and then tried to force encode the string using US-ASCII which is not what browsers do.

I think a better solution would be to map the encoding labels to their encodings and then use force_encoding() with that map. I actually wrote some code to do that; however, it's in Rust (included below). I can't imagine anyone is relying on the current behavior, but this would be a breaking change and I've been surprised before.

    pub fn lookup(label: &str) -> Option<Encoding> {
        match label.trim().to_ascii_lowercase().as_str() {
            "unicode-1-1-utf-8"
            | "unicode11utf8"
            | "unicode20utf8"
            | "utf-8"
            | "utf8"
            | "x-unicode20utf8" => Some(Encoding::UTF_8),
            "866"
            | "cp866"
            | "csibm866"
            | "ibm866" => Some(Encoding::IBM866),
            "csisolatin2"
            | "iso-8859-2"
            | "iso-ir-101"
            | "iso8859-2"
            | "iso88592"
            | "iso_8859-2"
            | "iso_8859-2:1987"
            | "l2"
            | "latin2" => Some(Encoding::ISO_8859_2),
            "csisolatin3"
            | "iso-8859-3"
            | "iso-ir-109"
            | "iso8859-3"
            | "iso88593"
            | "iso_8859-3"
            | "iso_8859-3:1988"
            | "l3"
            | "latin3" => Some(Encoding::ISO_8859_3),
            "csisolatin4"
            | "iso-8859-4"
            | "iso-ir-110"
            | "iso8859-4"
            | "iso88594"
            | "iso_8859-4"
            | "iso_8859-4:1988"
            | "l4"
            | "latin4" => Some(Encoding::ISO_8859_4),
            "csisolatincyrillic"
            | "cyrillic"
            | "iso-8859-5"
            | "iso-ir-144"
            | "iso8859-5"
            | "iso88595"
            | "iso_8859-5"
            | "iso_8859-5:1988" => Some(Encoding::ISO_8859_5),
            "arabic"
            | "asmo-708"
            | "csiso88596e"
            | "csiso88596i"
            | "csisolatinarabic"
            | "ecma-114"
            | "iso-8859-6"
            | "iso-8859-6-e"
            | "iso-8859-6-i"
            | "iso-ir-127"
            | "iso8859-6"
            | "iso88596"
            | "iso_8859-6"
            | "iso_8859-6:1987" => Some(Encoding::ISO_8859_6),
            "csisolatingreek"
            | "ecma-118"
            | "elot_928"
            | "greek"
            | "greek8"
            | "iso-8859-7"
            | "iso-ir-126"
            | "iso8859-7"
            | "iso88597"
            | "iso_8859-7"
            | "iso_8859-7:1987"
            | "sun_eu_greek" => Some(Encoding::ISO_8859_7),
            "csiso88598e"
            | "csisolatinhebrew"
            | "hebrew"
            | "iso-8859-8"
            | "iso-8859-8-e"
            | "iso-ir-138"
            | "iso8859-8"
            | "iso88598"
            | "iso_8859-8"
            | "iso_8859-8:1988"
            | "visual" => Some(Encoding::ISO_8859_8),
            "csiso88598i"
            | "iso-8859-8-i"
            | "logical" => Some(Encoding::ISO_8859_8_I),
            "csisolatin6"
            | "iso-8859-10"
            | "iso-ir-157"
            | "iso8859-10"
            | "iso885910"
            | "l6"
            | "latin6" => Some(Encoding::ISO_8859_10),
            "iso-8859-13"
            | "iso8859-13"
            | "iso885913" => Some(Encoding::ISO_8859_13),
            "iso-8859-14"
            | "iso8859-14"
            | "iso885914" => Some(Encoding::ISO_8859_14),
            "csisolatin9"
            | "iso-8859-15"
            | "iso8859-15"
            | "iso885915"
            | "iso_8859-15"
            | "l9" => Some(Encoding::ISO_8859_15),
            "iso-8859-16" => Some(Encoding::ISO_8859_16),
            "cskoi8r"
            | "koi"
            | "koi8"
            | "koi8-r"
            | "koi8_r" => Some(Encoding::KOI8_R),
            "koi8-ru"
            | "koi8-u" => Some(Encoding::KOI8_U),
            "csmacintosh"
            | "mac"
            | "macintosh"
            | "x-mac-roman" => Some(Encoding::Macintosh),
            "dos-874"
            | "iso-8859-11"
            | "iso8859-11"
            | "iso885911"
            | "tis-620"
            | "windows-874" => Some(Encoding::Windows_874),
            "cp1250"
            | "windows-1250"
            | "x-cp1250" => Some(Encoding::Windows_1250),
            "cp1251"
            | "windows-1251"
            | "x-cp1251" => Some(Encoding::Windows_1251),
            "ansi_x3.4-1968"
            | "ascii"
            | "cp1252"
            | "cp819"
            | "csisolatin1"
            | "ibm819"
            | "iso-8859-1"
            | "iso-ir-100"
            | "iso8859-1"
            | "iso88591"
            | "iso_8859-1"
            | "iso_8859-1:1987"
            | "l1"
            | "latin1"
            | "us-ascii"
            | "windows-1252"
            | "x-cp1252" => Some(Encoding::Windows_1252),
            "cp1253"
            | "windows-1253"
            | "x-cp1253" => Some(Encoding::Windows_1253),
            "cp1254"
            | "csisolatin5"
            | "iso-8859-9"
            | "iso-ir-148"
            | "iso8859-9"
            | "iso88599"
            | "iso_8859-9"
            | "iso_8859-9:1989"
            | "l5"
            | "latin5"
            | "windows-1254"
            | "x-cp1254" => Some(Encoding::Windows_1254),
            "cp1255"
            | "windows-1255"
            | "x-cp1255" => Some(Encoding::Windows_1255),
            "cp1256"
            | "windows-1256"
            | "x-cp1256" => Some(Encoding::Windows_1256),
            "cp1257"
            | "windows-1257"
            | "x-cp1257" => Some(Encoding::Windows_1257),
            "cp1258"
            | "windows-1258"
            | "x-cp1258" => Some(Encoding::Windows_1258),
            "x-mac-cyrillic"
            | "x-mac-ukrainian" => Some(Encoding::X_Mac_Cyrillic),
            "chinese"
            | "csgb2312"
            | "csiso58gb231280"
            | "gb2312"
            | "gb_2312"
            | "gb_2312-80"
            | "gbk"
            | "iso-ir-58"
            | "x-gbk" => Some(Encoding::GBK),
            "gb18030" => Some(Encoding::Gb18030),
            "big5"
            | "big5-hkscs"
            | "cn-big5"
            | "csbig5"
            | "x-x-big5" => Some(Encoding::Big5),
            "cseucpkdfmtjapanese"
            | "euc-jp"
            | "x-euc-jp" => Some(Encoding::EUC_JP),
            "csiso2022jp"
            | "iso-2022-jp" => Some(Encoding::ISO_2022_JP),
            "csshiftjis"
            | "ms932"
            | "ms_kanji"
            | "shift-jis"
            | "shift_jis"
            | "sjis"
            | "windows-31j"
            | "x-sjis" => Some(Encoding::Shift_JIS),
            "cseuckr"
            | "csksc56011987"
            | "euc-kr"
            | "iso-ir-149"
            | "korean"
            | "ks_c_5601-1987"
            | "ks_c_5601-1989"
            | "ksc5601"
            | "ksc_5601"
            | "windows-949" => Some(Encoding::EUC_KR),
            "csiso2022kr"
            | "hz-gb-2312"
            | "iso-2022-cn"
            | "iso-2022-cn-ext"
            | "iso-2022-kr"
            | "replacement" => Some(Encoding::Replacement),
            "unicodefffe"
            | "utf-16be" => Some(Encoding::UTF_16BE),
            "csunicode"
            | "iso-10646-ucs-2"
            | "ucs-2"
            | "unicode"
            | "unicodefeff"
            | "utf-16"
            | "utf-16le" => Some(Encoding::UTF_16LE),
            "x-user-defined" => Some(Encoding::X_User_Defined),
            _ => None,
        }
    }
flavorjones commented 2 years ago

@aaronstillwell Are you unblocked at this point? It seems like there are a few ways offered above to try to parse the document. But probably your best bet is to continue using the HTML4 parser and pass in a compatible encoding to the Document#parse() method call.

aaronstillwell commented 2 years ago

@flavorjones @stevecheckoway thanks both for your thorough follow-ups. One last question if I may before we close this out - is there an encoding map like the one @stevecheckoway just produced in rust for use w/ nokogiri?

I can hard-code support this way for ASCII-8BIT to use ISO-8859-1 but I imagine being able to ensure this is properly enforced for all encodings is the right way to ensure no other problems arise (I cannot control what documents may be sent my way 😄 )

aaronstillwell commented 2 years ago

Hey folks, just circling back here. Having been up and running with the hard coded support for ASCII-8BIT to ISO-8859-1, I'm frequently running into issues where characters aren't correctly appearing in the transformed document, resulting in characters like Ä—...

Looking at a couple of problematic examples, I can see that the ruby string is encoded using ASCII-8BIT, but the HTML document contains a meta tag suggesting something different, e.g

<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <meta charset="UTF-8">

Would I be right in concluding that, in my case:

I fully appreciate we might be getting a little bit further into the nuances of my own implementation rather than issues core to Nokogiri - happy to close this out upon clarifying some of the above.

edit: Or just force ISO-8859-1 when I find us-ascii encoded docs to circumvent my original issue, and leave everything else to default nokogiri? I may have overcomplicated things with my original workaround!

stevecheckoway commented 2 years ago

@aaronstillwell The HTML standard specifies how browsers should determine the character set. Unfortunately it's a bit complicated and takes in information that's not present in the html alone. Nevertheless, if you ignore the other inputs to the algorithm, it boils down to

  1. If the document starts with a Unicode byte order mark, then that's the encoding;
  2. Otherwise, scan the first 1024 bytes of the document looking for a meta tag (or other indicators) that specify the character set.

The HTML 5 parser performs both of these steps (and re-encodes the input to UTF-8 after detection).

If you want Ruby code to perform the encoding normalization, then this was what I just put together based on my Ruby code.

# frozen_string_literal: true

def normalize_encoding(encoding)
  case encoding.strip.downcase
  when 'unicode-1-1-utf-8', 'unicode11utf8', 'unicode20utf8', 'utf-8', 'utf8', 'x-unicode20utf8'
    Encoding::UTF_8
  when '866', 'cp866', 'csibm866', 'ibm866'
    Encoding::IBM866
  when 'csisolatin2', 'iso-8859-2', 'iso-ir-101', 'iso8859-2', 'iso88592', 'iso_8859-2', 'iso_8859-2:1987', 'l2', 'latin2'
    Encoding::ISO_8859_2
  when 'csisolatin3', 'iso-8859-3', 'iso-ir-109', 'iso8859-3', 'iso88593', 'iso_8859-3', 'iso_8859-3:1988', 'l3', 'latin3'
    Encoding::ISO_8859_3
  when 'csisolatin4', 'iso-8859-4', 'iso-ir-110', 'iso8859-4', 'iso88594', 'iso_8859-4', 'iso_8859-4:1988', 'l4', 'latin4'
    Encoding::ISO_8859_4
  when 'csisolatincyrillic', 'cyrillic', 'iso-8859-5', 'iso-ir-144', 'iso8859-5', 'iso88595', 'iso_8859-5', 'iso_8859-5:1988'
    Encoding::ISO_8859_5
  when 'arabic', 'asmo-708', 'csiso88596e', 'csiso88596i', 'csisolatinarabic', 'ecma-114', 'iso-8859-6', 'iso-8859-6-e', 'iso-8859-6-i', 'iso-ir-127', 'iso8859-6', 'iso88596', 'iso_8859-6', 'iso_8859-6:1987'
    Encoding::ISO_8859_6
  when 'csisolatingreek', 'ecma-118', 'elot_928', 'greek', 'greek8', 'iso-8859-7', 'iso-ir-126', 'iso8859-7', 'iso88597', 'iso_8859-7', 'iso_8859-7:1987', 'sun_eu_greek'
    Encoding::ISO_8859_7
  when 'csiso88598e', 'csisolatinhebrew', 'hebrew', 'iso-8859-8', 'iso-8859-8-e', 'iso-ir-138', 'iso8859-8', 'iso88598', 'iso_8859-8', 'iso_8859-8:1988', 'visual'
    Encoding::ISO_8859_8
  #when 'csiso88598i', 'iso-8859-8-i', 'logical'
  #  Encoding::ISO_8859_8_I
  when 'csisolatin6', 'iso-8859-10', 'iso-ir-157', 'iso8859-10', 'iso885910', 'l6', 'latin6'
    Encoding::ISO_8859_10
  when 'iso-8859-13', 'iso8859-13', 'iso885913'
    Encoding::ISO_8859_13
  when 'iso-8859-14', 'iso8859-14', 'iso885914'
    Encoding::ISO_8859_14
  when 'csisolatin9', 'iso-8859-15', 'iso8859-15', 'iso885915', 'iso_8859-15', 'l9'
    Encoding::ISO_8859_15
  when 'iso-8859-16'
    Encoding::ISO_8859_16
  when 'cskoi8r', 'koi', 'koi8', 'koi8-r', 'koi8_r'
    Encoding::KOI8_R
  when 'koi8-ru', 'koi8-u'
    Encoding::KOI8_U
  when 'csmacintosh', 'mac', 'macintosh', 'x-mac-roman'
    Encoding::MacRoman
  when 'dos-874', 'iso-8859-11', 'iso8859-11', 'iso885911', 'tis-620', 'windows-874'
    Encoding::Windows_874
  when 'cp1250', 'windows-1250', 'x-cp1250'
    Encoding::Windows_1250
  when 'cp1251', 'windows-1251', 'x-cp1251'
    Encoding::Windows_1251
  when 'ansi_x3.4-1968', 'ascii', 'cp1252', 'cp819', 'csisolatin1', 'ibm819', 'iso-8859-1', 'iso-ir-100', 'iso8859-1', 'iso88591', 'iso_8859-1', 'iso_8859-1:1987', 'l1', 'latin1', 'us-ascii', 'windows-1252', 'x-cp1252'
    Encoding::Windows_1252
  when 'cp1253', 'windows-1253', 'x-cp1253'
    Encoding::Windows_1253
  when 'cp1254', 'csisolatin5', 'iso-8859-9', 'iso-ir-148', 'iso8859-9', 'iso88599', 'iso_8859-9', 'iso_8859-9:1989', 'l5', 'latin5', 'windows-1254', 'x-cp1254'
    Encoding::Windows_1254
  when 'cp1255', 'windows-1255', 'x-cp1255'
    Encoding::Windows_1255
  when 'cp1256', 'windows-1256', 'x-cp1256'
    Encoding::Windows_1256
  when 'cp1257', 'windows-1257', 'x-cp1257'
    Encoding::Windows_1257
  when 'cp1258', 'windows-1258', 'x-cp1258'
    Encoding::Windows_1258
  when 'x-mac-cyrillic', 'x-mac-ukrainian'
    Encoding::MacCyrillic
  when 'chinese', 'csgb2312', 'csiso58gb231280', 'gb2312', 'gb_2312', 'gb_2312-80', 'gbk', 'iso-ir-58', 'x-gbk'
    Encoding::GBK
  when 'gb18030'
    Encoding::GB18030
  when 'big5', 'big5-hkscs', 'cn-big5', 'csbig5', 'x-x-big5'
    Encoding::Big5
  when 'cseucpkdfmtjapanese', 'euc-jp', 'x-euc-jp'
    Encoding::EUC_JP
  when 'csiso2022jp', 'iso-2022-jp'
    Encoding::ISO_2022_JP
  when 'csshiftjis', 'ms932', 'ms_kanji', 'shift-jis', 'shift_jis', 'sjis', 'windows-31j', 'x-sjis'
    Encoding::Shift_JIS
  when 'cseuckr', 'csksc56011987', 'euc-kr', 'iso-ir-149', 'korean', 'ks_c_5601-1987', 'ks_c_5601-1989', 'ksc5601', 'ksc_5601', 'windows-949'
    Encoding::EUC_KR
  #when 'csiso2022kr', 'hz-gb-2312', 'iso-2022-cn', 'iso-2022-cn-ext', 'iso-2022-kr', 'replacement'
  #  Encoding::Replacement
  when 'unicodefffe', 'utf-16be'
    Encoding::UTF_16BE
  when 'csunicode', 'iso-10646-ucs-2', 'ucs-2', 'unicode', 'unicodefeff', 'utf-16', 'utf-16le'
    Encoding::UTF_16LE
  #when 'x-user-defined'
  #  Encoding::X_User_Defined
  else
    nil
  end
end

ENCODINGS = [
    Encoding::UTF_8,
    Encoding::IBM866,
    Encoding::ISO_8859_2,
    Encoding::ISO_8859_3,
    Encoding::ISO_8859_4,
    Encoding::ISO_8859_5,
    Encoding::ISO_8859_6,
    Encoding::ISO_8859_7,
    Encoding::ISO_8859_8,
    #Encoding::ISO_8859_8_I,
    Encoding::ISO_8859_10,
    Encoding::ISO_8859_13,
    Encoding::ISO_8859_14,
    Encoding::ISO_8859_15,
    Encoding::ISO_8859_16,
    Encoding::KOI8_R,
    Encoding::KOI8_U,
    Encoding::MacRoman,
    Encoding::Windows_874,
    Encoding::Windows_1250,
    Encoding::Windows_1251,
    Encoding::Windows_1252,
    Encoding::Windows_1253,
    Encoding::Windows_1254,
    Encoding::Windows_1255,
    Encoding::Windows_1256,
    Encoding::Windows_1257,
    Encoding::Windows_1258,
    Encoding::MacCyrillic,
    Encoding::GBK,
    Encoding::GB18030,
    Encoding::Big5,
    Encoding::EUC_JP,
    Encoding::ISO_2022_JP,
    Encoding::Shift_JIS,
    Encoding::EUC_KR,
    #Encoding::Replacement,
    Encoding::UTF_16BE,
    Encoding::UTF_16LE,
    #Encoding::X_User_Defined,
]

And I included a list of the supported encodings at the end. Note that ISO-8859-8-I doesn't seem to be supported. X_User_Defined is a simple encoding that might have another name I don't know (or might not be supported by Ruby). The Replacement encoding isn't a real encoding. (Decoding the empty string produces the empty string. Decoding anything else produces a 1-character string containing the Unicode replacement character.)

If you want, you could take the reencode code, add that normalization step and then ask Nokogiri to parse using the normalized encoding.

flavorjones commented 2 years ago

One more note: Ruby's ASCII-8BIT encoding is not actually an encoding -- it's what Ruby uses to signify "this is binary data". It's an alias for BINARY. As such, it says nothing about the actual encoding of the bytes representing the document.

aaronstillwell commented 2 years ago

Thanks for your combined input @stevecheckoway @flavorjones! Happy to close this out from my end.