Closed aaronstillwell closed 2 years ago
@aaronstillwell Hi! Thanks for asking this question, and sorry you're having trouble. I'll try to help.
First, it's important to note that the document encoding may be different from the string encoding. We see this here, where \xA0
is neither ASCII (which confuses libxml2 into terminating parsing) and isn't UTF-8 (which generates warnings for us).
The right thing to do here is to tell Nokogiri and libxml2 what the document's encoding is. To be fair, this document is encoded incorrectly -- I believe the author means to include a non-breaking space, but has done so in an incorrect way, likely trying to inject unicode bytes. The portable way to communicate a non-breaking space in an HTML document is to use an entity, one of
,  
, or  
.
Anyway. If you want to avoid parsing errors or string encoding errors, try this:
doc = Nokogiri::HTML(html, nil, "ISO-8859-1")
which tells Nokogiri to treat the document as if it were in Latin-1 encoding, which allows the full 256 values of an 8-bit character (ASCII only allows the lower 128 values).
Does this all make sense? Happy to answer questions, encoding can be challenging particularly when things are going wrong.
I'm not sure what the correct behavior for XHTML is since it depends on the rules for XML which I don't know in great detail.
For HTML, I think the document is correct. The us-ascii
charset is mapped to Windows 1252 which uses \xA0
for NBSP. But I don't know how the string encoding interacts with this.
Steve, you're right, I think, but there's some interesting behaviors that are interacting.
us-ascii
) ... but then libxml2 does not handle this characterus-ascii
in the call to .HTML()
libxml2 doesn't handle this character (regardless of string encoding).looking at libxml2's encoding.c
it's not obvious to me why it doesn't handle this character. But ...
@aaronstillwell So another approach is to move to the HTML5 parser which handles this document fine.
doc = Nokogiri::HTML5(HTML)
Another option may be to covert \xA0
to \u00A0
which (if I got my Ruby syntax correct), should give the correct unicode character, but then the charset
would need to be changed to utf-8
. (Edit: Or I guess should be changed. Otherwise, you might have issues serializing the DOM and then reading it in a browser which will do the charset detection, decide on Windows 1252, and then fail to parse the multi-byte NBSP character.)
Thanks both for your input here.
@flavorjones is there any trade-off to using the HTML5 parser? I understand from the docs that this uses an alternative parser under the hood, however I assume given its recommendation here that despite the name, it can be used for all sorts of (X)HTML? If so that looks to me like the easiest fix for my use case.
Additionally, if I understand correctly that this is not expected behavior, and this stems from libxml, should I be raising an issue there?
@flavorjones just attempting your recommendation, but I see the following error:
/home/app/vendor/bundle/ruby/3.1.0/gems/nokogiri-1.13.3-x86_64-linux/lib/nokogiri/html5.rb:389:in `encode': "\xA0" on US-ASCII (Encoding::InvalidByteSequenceError)
Is this expected?
@aaronstillwell One trade off is the modern HTML parser (the one not built using libxml2) does not support XHTML.
Ah, I think I see what the problem is (although my line numbers are a ways off).
# change the encoding to match the detected or inferred encoding
body = body.dup
begin
body.force_encoding(encoding)
rescue ArgumentError
body.force_encoding(Encoding::ISO_8859_1)
end
I think it has detected the us-ascii
charset and then tried to force encode the string using US-ASCII which is not what browsers do.
I think a better solution would be to map the encoding labels to their encodings and then use force_encoding()
with that map. I actually wrote some code to do that; however, it's in Rust (included below). I can't imagine anyone is relying on the current behavior, but this would be a breaking change and I've been surprised before.
pub fn lookup(label: &str) -> Option<Encoding> {
match label.trim().to_ascii_lowercase().as_str() {
"unicode-1-1-utf-8"
| "unicode11utf8"
| "unicode20utf8"
| "utf-8"
| "utf8"
| "x-unicode20utf8" => Some(Encoding::UTF_8),
"866"
| "cp866"
| "csibm866"
| "ibm866" => Some(Encoding::IBM866),
"csisolatin2"
| "iso-8859-2"
| "iso-ir-101"
| "iso8859-2"
| "iso88592"
| "iso_8859-2"
| "iso_8859-2:1987"
| "l2"
| "latin2" => Some(Encoding::ISO_8859_2),
"csisolatin3"
| "iso-8859-3"
| "iso-ir-109"
| "iso8859-3"
| "iso88593"
| "iso_8859-3"
| "iso_8859-3:1988"
| "l3"
| "latin3" => Some(Encoding::ISO_8859_3),
"csisolatin4"
| "iso-8859-4"
| "iso-ir-110"
| "iso8859-4"
| "iso88594"
| "iso_8859-4"
| "iso_8859-4:1988"
| "l4"
| "latin4" => Some(Encoding::ISO_8859_4),
"csisolatincyrillic"
| "cyrillic"
| "iso-8859-5"
| "iso-ir-144"
| "iso8859-5"
| "iso88595"
| "iso_8859-5"
| "iso_8859-5:1988" => Some(Encoding::ISO_8859_5),
"arabic"
| "asmo-708"
| "csiso88596e"
| "csiso88596i"
| "csisolatinarabic"
| "ecma-114"
| "iso-8859-6"
| "iso-8859-6-e"
| "iso-8859-6-i"
| "iso-ir-127"
| "iso8859-6"
| "iso88596"
| "iso_8859-6"
| "iso_8859-6:1987" => Some(Encoding::ISO_8859_6),
"csisolatingreek"
| "ecma-118"
| "elot_928"
| "greek"
| "greek8"
| "iso-8859-7"
| "iso-ir-126"
| "iso8859-7"
| "iso88597"
| "iso_8859-7"
| "iso_8859-7:1987"
| "sun_eu_greek" => Some(Encoding::ISO_8859_7),
"csiso88598e"
| "csisolatinhebrew"
| "hebrew"
| "iso-8859-8"
| "iso-8859-8-e"
| "iso-ir-138"
| "iso8859-8"
| "iso88598"
| "iso_8859-8"
| "iso_8859-8:1988"
| "visual" => Some(Encoding::ISO_8859_8),
"csiso88598i"
| "iso-8859-8-i"
| "logical" => Some(Encoding::ISO_8859_8_I),
"csisolatin6"
| "iso-8859-10"
| "iso-ir-157"
| "iso8859-10"
| "iso885910"
| "l6"
| "latin6" => Some(Encoding::ISO_8859_10),
"iso-8859-13"
| "iso8859-13"
| "iso885913" => Some(Encoding::ISO_8859_13),
"iso-8859-14"
| "iso8859-14"
| "iso885914" => Some(Encoding::ISO_8859_14),
"csisolatin9"
| "iso-8859-15"
| "iso8859-15"
| "iso885915"
| "iso_8859-15"
| "l9" => Some(Encoding::ISO_8859_15),
"iso-8859-16" => Some(Encoding::ISO_8859_16),
"cskoi8r"
| "koi"
| "koi8"
| "koi8-r"
| "koi8_r" => Some(Encoding::KOI8_R),
"koi8-ru"
| "koi8-u" => Some(Encoding::KOI8_U),
"csmacintosh"
| "mac"
| "macintosh"
| "x-mac-roman" => Some(Encoding::Macintosh),
"dos-874"
| "iso-8859-11"
| "iso8859-11"
| "iso885911"
| "tis-620"
| "windows-874" => Some(Encoding::Windows_874),
"cp1250"
| "windows-1250"
| "x-cp1250" => Some(Encoding::Windows_1250),
"cp1251"
| "windows-1251"
| "x-cp1251" => Some(Encoding::Windows_1251),
"ansi_x3.4-1968"
| "ascii"
| "cp1252"
| "cp819"
| "csisolatin1"
| "ibm819"
| "iso-8859-1"
| "iso-ir-100"
| "iso8859-1"
| "iso88591"
| "iso_8859-1"
| "iso_8859-1:1987"
| "l1"
| "latin1"
| "us-ascii"
| "windows-1252"
| "x-cp1252" => Some(Encoding::Windows_1252),
"cp1253"
| "windows-1253"
| "x-cp1253" => Some(Encoding::Windows_1253),
"cp1254"
| "csisolatin5"
| "iso-8859-9"
| "iso-ir-148"
| "iso8859-9"
| "iso88599"
| "iso_8859-9"
| "iso_8859-9:1989"
| "l5"
| "latin5"
| "windows-1254"
| "x-cp1254" => Some(Encoding::Windows_1254),
"cp1255"
| "windows-1255"
| "x-cp1255" => Some(Encoding::Windows_1255),
"cp1256"
| "windows-1256"
| "x-cp1256" => Some(Encoding::Windows_1256),
"cp1257"
| "windows-1257"
| "x-cp1257" => Some(Encoding::Windows_1257),
"cp1258"
| "windows-1258"
| "x-cp1258" => Some(Encoding::Windows_1258),
"x-mac-cyrillic"
| "x-mac-ukrainian" => Some(Encoding::X_Mac_Cyrillic),
"chinese"
| "csgb2312"
| "csiso58gb231280"
| "gb2312"
| "gb_2312"
| "gb_2312-80"
| "gbk"
| "iso-ir-58"
| "x-gbk" => Some(Encoding::GBK),
"gb18030" => Some(Encoding::Gb18030),
"big5"
| "big5-hkscs"
| "cn-big5"
| "csbig5"
| "x-x-big5" => Some(Encoding::Big5),
"cseucpkdfmtjapanese"
| "euc-jp"
| "x-euc-jp" => Some(Encoding::EUC_JP),
"csiso2022jp"
| "iso-2022-jp" => Some(Encoding::ISO_2022_JP),
"csshiftjis"
| "ms932"
| "ms_kanji"
| "shift-jis"
| "shift_jis"
| "sjis"
| "windows-31j"
| "x-sjis" => Some(Encoding::Shift_JIS),
"cseuckr"
| "csksc56011987"
| "euc-kr"
| "iso-ir-149"
| "korean"
| "ks_c_5601-1987"
| "ks_c_5601-1989"
| "ksc5601"
| "ksc_5601"
| "windows-949" => Some(Encoding::EUC_KR),
"csiso2022kr"
| "hz-gb-2312"
| "iso-2022-cn"
| "iso-2022-cn-ext"
| "iso-2022-kr"
| "replacement" => Some(Encoding::Replacement),
"unicodefffe"
| "utf-16be" => Some(Encoding::UTF_16BE),
"csunicode"
| "iso-10646-ucs-2"
| "ucs-2"
| "unicode"
| "unicodefeff"
| "utf-16"
| "utf-16le" => Some(Encoding::UTF_16LE),
"x-user-defined" => Some(Encoding::X_User_Defined),
_ => None,
}
}
@aaronstillwell Are you unblocked at this point? It seems like there are a few ways offered above to try to parse the document. But probably your best bet is to continue using the HTML4 parser and pass in a compatible encoding to the Document#parse()
method call.
@flavorjones @stevecheckoway thanks both for your thorough follow-ups. One last question if I may before we close this out - is there an encoding map like the one @stevecheckoway just produced in rust for use w/ nokogiri?
I can hard-code support this way for ASCII-8BIT
to use ISO-8859-1
but I imagine being able to ensure this is properly enforced for all encodings is the right way to ensure no other problems arise (I cannot control what documents may be sent my way 😄 )
Hey folks, just circling back here. Having been up and running with the hard coded support for ASCII-8BIT
to ISO-8859-1
, I'm frequently running into issues where characters aren't correctly appearing in the transformed document, resulting in characters like Ä...
Looking at a couple of problematic examples, I can see that the ruby string is encoded using ASCII-8BIT
, but the HTML document contains a meta tag suggesting something different, e.g
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta charset="UTF-8">
Would I be right in concluding that, in my case:
head
section then terminate?)ISO-8859-1
on us-ascii
encoded docs as per the original postI fully appreciate we might be getting a little bit further into the nuances of my own implementation rather than issues core to Nokogiri - happy to close this out upon clarifying some of the above.
edit:
Or just force ISO-8859-1
when I find us-ascii
encoded docs to circumvent my original issue, and leave everything else to default nokogiri? I may have overcomplicated things with my original workaround!
@aaronstillwell The HTML standard specifies how browsers should determine the character set. Unfortunately it's a bit complicated and takes in information that's not present in the html alone. Nevertheless, if you ignore the other inputs to the algorithm, it boils down to
meta
tag (or other indicators) that specify the character set.The HTML 5 parser performs both of these steps (and re-encodes the input to UTF-8 after detection).
If you want Ruby code to perform the encoding normalization, then this was what I just put together based on my Ruby code.
# frozen_string_literal: true
def normalize_encoding(encoding)
case encoding.strip.downcase
when 'unicode-1-1-utf-8', 'unicode11utf8', 'unicode20utf8', 'utf-8', 'utf8', 'x-unicode20utf8'
Encoding::UTF_8
when '866', 'cp866', 'csibm866', 'ibm866'
Encoding::IBM866
when 'csisolatin2', 'iso-8859-2', 'iso-ir-101', 'iso8859-2', 'iso88592', 'iso_8859-2', 'iso_8859-2:1987', 'l2', 'latin2'
Encoding::ISO_8859_2
when 'csisolatin3', 'iso-8859-3', 'iso-ir-109', 'iso8859-3', 'iso88593', 'iso_8859-3', 'iso_8859-3:1988', 'l3', 'latin3'
Encoding::ISO_8859_3
when 'csisolatin4', 'iso-8859-4', 'iso-ir-110', 'iso8859-4', 'iso88594', 'iso_8859-4', 'iso_8859-4:1988', 'l4', 'latin4'
Encoding::ISO_8859_4
when 'csisolatincyrillic', 'cyrillic', 'iso-8859-5', 'iso-ir-144', 'iso8859-5', 'iso88595', 'iso_8859-5', 'iso_8859-5:1988'
Encoding::ISO_8859_5
when 'arabic', 'asmo-708', 'csiso88596e', 'csiso88596i', 'csisolatinarabic', 'ecma-114', 'iso-8859-6', 'iso-8859-6-e', 'iso-8859-6-i', 'iso-ir-127', 'iso8859-6', 'iso88596', 'iso_8859-6', 'iso_8859-6:1987'
Encoding::ISO_8859_6
when 'csisolatingreek', 'ecma-118', 'elot_928', 'greek', 'greek8', 'iso-8859-7', 'iso-ir-126', 'iso8859-7', 'iso88597', 'iso_8859-7', 'iso_8859-7:1987', 'sun_eu_greek'
Encoding::ISO_8859_7
when 'csiso88598e', 'csisolatinhebrew', 'hebrew', 'iso-8859-8', 'iso-8859-8-e', 'iso-ir-138', 'iso8859-8', 'iso88598', 'iso_8859-8', 'iso_8859-8:1988', 'visual'
Encoding::ISO_8859_8
#when 'csiso88598i', 'iso-8859-8-i', 'logical'
# Encoding::ISO_8859_8_I
when 'csisolatin6', 'iso-8859-10', 'iso-ir-157', 'iso8859-10', 'iso885910', 'l6', 'latin6'
Encoding::ISO_8859_10
when 'iso-8859-13', 'iso8859-13', 'iso885913'
Encoding::ISO_8859_13
when 'iso-8859-14', 'iso8859-14', 'iso885914'
Encoding::ISO_8859_14
when 'csisolatin9', 'iso-8859-15', 'iso8859-15', 'iso885915', 'iso_8859-15', 'l9'
Encoding::ISO_8859_15
when 'iso-8859-16'
Encoding::ISO_8859_16
when 'cskoi8r', 'koi', 'koi8', 'koi8-r', 'koi8_r'
Encoding::KOI8_R
when 'koi8-ru', 'koi8-u'
Encoding::KOI8_U
when 'csmacintosh', 'mac', 'macintosh', 'x-mac-roman'
Encoding::MacRoman
when 'dos-874', 'iso-8859-11', 'iso8859-11', 'iso885911', 'tis-620', 'windows-874'
Encoding::Windows_874
when 'cp1250', 'windows-1250', 'x-cp1250'
Encoding::Windows_1250
when 'cp1251', 'windows-1251', 'x-cp1251'
Encoding::Windows_1251
when 'ansi_x3.4-1968', 'ascii', 'cp1252', 'cp819', 'csisolatin1', 'ibm819', 'iso-8859-1', 'iso-ir-100', 'iso8859-1', 'iso88591', 'iso_8859-1', 'iso_8859-1:1987', 'l1', 'latin1', 'us-ascii', 'windows-1252', 'x-cp1252'
Encoding::Windows_1252
when 'cp1253', 'windows-1253', 'x-cp1253'
Encoding::Windows_1253
when 'cp1254', 'csisolatin5', 'iso-8859-9', 'iso-ir-148', 'iso8859-9', 'iso88599', 'iso_8859-9', 'iso_8859-9:1989', 'l5', 'latin5', 'windows-1254', 'x-cp1254'
Encoding::Windows_1254
when 'cp1255', 'windows-1255', 'x-cp1255'
Encoding::Windows_1255
when 'cp1256', 'windows-1256', 'x-cp1256'
Encoding::Windows_1256
when 'cp1257', 'windows-1257', 'x-cp1257'
Encoding::Windows_1257
when 'cp1258', 'windows-1258', 'x-cp1258'
Encoding::Windows_1258
when 'x-mac-cyrillic', 'x-mac-ukrainian'
Encoding::MacCyrillic
when 'chinese', 'csgb2312', 'csiso58gb231280', 'gb2312', 'gb_2312', 'gb_2312-80', 'gbk', 'iso-ir-58', 'x-gbk'
Encoding::GBK
when 'gb18030'
Encoding::GB18030
when 'big5', 'big5-hkscs', 'cn-big5', 'csbig5', 'x-x-big5'
Encoding::Big5
when 'cseucpkdfmtjapanese', 'euc-jp', 'x-euc-jp'
Encoding::EUC_JP
when 'csiso2022jp', 'iso-2022-jp'
Encoding::ISO_2022_JP
when 'csshiftjis', 'ms932', 'ms_kanji', 'shift-jis', 'shift_jis', 'sjis', 'windows-31j', 'x-sjis'
Encoding::Shift_JIS
when 'cseuckr', 'csksc56011987', 'euc-kr', 'iso-ir-149', 'korean', 'ks_c_5601-1987', 'ks_c_5601-1989', 'ksc5601', 'ksc_5601', 'windows-949'
Encoding::EUC_KR
#when 'csiso2022kr', 'hz-gb-2312', 'iso-2022-cn', 'iso-2022-cn-ext', 'iso-2022-kr', 'replacement'
# Encoding::Replacement
when 'unicodefffe', 'utf-16be'
Encoding::UTF_16BE
when 'csunicode', 'iso-10646-ucs-2', 'ucs-2', 'unicode', 'unicodefeff', 'utf-16', 'utf-16le'
Encoding::UTF_16LE
#when 'x-user-defined'
# Encoding::X_User_Defined
else
nil
end
end
ENCODINGS = [
Encoding::UTF_8,
Encoding::IBM866,
Encoding::ISO_8859_2,
Encoding::ISO_8859_3,
Encoding::ISO_8859_4,
Encoding::ISO_8859_5,
Encoding::ISO_8859_6,
Encoding::ISO_8859_7,
Encoding::ISO_8859_8,
#Encoding::ISO_8859_8_I,
Encoding::ISO_8859_10,
Encoding::ISO_8859_13,
Encoding::ISO_8859_14,
Encoding::ISO_8859_15,
Encoding::ISO_8859_16,
Encoding::KOI8_R,
Encoding::KOI8_U,
Encoding::MacRoman,
Encoding::Windows_874,
Encoding::Windows_1250,
Encoding::Windows_1251,
Encoding::Windows_1252,
Encoding::Windows_1253,
Encoding::Windows_1254,
Encoding::Windows_1255,
Encoding::Windows_1256,
Encoding::Windows_1257,
Encoding::Windows_1258,
Encoding::MacCyrillic,
Encoding::GBK,
Encoding::GB18030,
Encoding::Big5,
Encoding::EUC_JP,
Encoding::ISO_2022_JP,
Encoding::Shift_JIS,
Encoding::EUC_KR,
#Encoding::Replacement,
Encoding::UTF_16BE,
Encoding::UTF_16LE,
#Encoding::X_User_Defined,
]
And I included a list of the supported encodings at the end. Note that ISO-8859-8-I doesn't seem to be supported. X_User_Defined is a simple encoding that might have another name I don't know (or might not be supported by Ruby). The Replacement encoding isn't a real encoding. (Decoding the empty string produces the empty string. Decoding anything else produces a 1-character string containing the Unicode replacement character.)
If you want, you could take the reencode code, add that normalization step and then ask Nokogiri to parse using the normalized encoding.
One more note: Ruby's ASCII-8BIT
encoding is not actually an encoding -- it's what Ruby uses to signify "this is binary data". It's an alias for BINARY
. As such, it says nothing about the actual encoding of the bytes representing the document.
Thanks for your combined input @stevecheckoway @flavorjones! Happy to close this out from my end.
What problem are you trying to solve?
I'm trying to parse a HTML document, encoded with
ASCII-8BIT
which features the\xA0
character. Nokogiri seems to quietly stop parsing the document when encountering such a character, as per the examples below.Is this expected behaviour? If so, what the recommended approach to parsing documents correctly encoded with this character?
Please show your code!
Environment
Additional context
The way the HTML is presented in my use case is not within my control.