scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
392 stars 104 forks source link

Scrapy can not auto detect GBK html encoding #155

Open samuelchen opened 4 years ago

samuelchen commented 4 years ago

Hi,

Thanks you guys for the great framework.

I am using scrapy to crawl multiple sites. Sites are diffrerent encodings. One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.

I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern _BODY_ENCODING_BYTES_RE can not correctly found the encoding in meta.

HTML snippet as below:

b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'

my test :

>>> from w3lib.encoding import html_body_declared_encoding
>>> b
b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
>>> html_body_declared_encoding(b)
>>> enc = html_body_declared_encoding(b)
>>> enc
>>> print('"%s"' % enc)
"None"
>>> soup = BeautifulSoup(b)
>>> soup.title
<title>网站地图</title>
>>> soup.original_encoding
'gbk'
>>>
kostalski commented 4 years ago

Hi @samuelchen @Gallaecio ,

Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ..., but valid (with w3c) should be <meta http-equiv="Content-Type" ...(missing dash character). Because of that w3lib is not detecting defined encoding.

beautifulsoup4 is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]').

For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py) From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type') To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')

After this fix w3lib would detected encoding as gb18030. This should have no side effects, but I don't know if it is right way ;) What you think @Gallaecio ?

More details below.


Details

I was able to reproduce issue with provided settings:

Test python script:

from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup

b = b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)

for parser in ['html5lib', 'html.parser', 'lxml']:
    soup = BeautifulSoup(b, parser)
    print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))

Script output:

html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk

Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection. lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

samuelchen commented 3 years ago

@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType". Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that. btw. GB18030 is compatible with GBK.

kostalski commented 3 years ago

Ok @samuelchen, no problem :+1: