Open samuelchen opened 4 years ago
Hi @samuelchen @Gallaecio ,
Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ...
, but valid (with w3c) should be <meta http-equiv="Content-Type" ...
(missing dash character). Because of that w3lib is not detecting defined encoding.
beautifulsoup4
is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'
).
For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py)
From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type')
To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')
After this fix w3lib would detected encoding as gb18030
. This should have no side effects, but I don't know if it is right way ;)
What you think @Gallaecio ?
More details below.
Details
I was able to reproduce issue with provided settings:
Test python script:
from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup
b = b'<HTML>\r\n <HEAD>\r\n <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)
for parser in ['html5lib', 'html.parser', 'lxml']:
soup = BeautifulSoup(b, parser)
print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))
Script output:
html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk
Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection.
lib: beautifulsoup4
file: bs4/dammit.py,
line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'
@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType"
. Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that.
btw. GB18030 is compatible with GBK.
Ok @samuelchen, no problem :+1:
Hi,
Thanks you guys for the great framework.
I am using scrapy to crawl multiple sites. Sites are diffrerent encodings. One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.
I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE
can not correctly found the encoding in meta.HTML snippet as below:
my test :