scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
390 stars 104 forks source link

Python's gb18030 decoder is not the same as w3c's #76

Open HyperHCl opened 7 years ago

HyperHCl commented 7 years ago

https://www.w3.org/TR/encoding/#gb18030-decoder specifies a single-byte special case 0x80 → U+20AC for gbk compatibility, but Python's decoder does not perform this translation.

redapple commented 7 years ago

@HyperHCl , I'm not sure this is the right place to report this decoding issue. Have you submitted the issue to the Python Core developers?

HyperHCl commented 7 years ago

Well it's nearly clear that Python upstream will not accept this issue: they usually try to support the original national standard, not a w3c/whatwg web-standard. Python's codecs are quite pedantic, cf. ftfy "sloppy" encodings. To Python this problem is just the world doing things The Wrong Way, but to make codecs useful for them people have to make it as wrong as the rest of the world.

redapple commented 7 years ago

@HyperHCl , I see. But where does this fit w3lib?

HyperHCl commented 7 years ago

By Googling for "whatwg encoding python" I found an implementation for that standard called webencodings. I haven't actually verified how well it works (or whether it works at all) though. Uh oops... It only provides a table of aliases that still points to Python's windows-1252 and gb18030. Sounds like time to invent a wheel -- say, w3lib.codecs or just a separate w3codecs.

Implementations for each codec in question:

openandclose commented 4 years ago

Since this thread is labeled as discussion...

I think many Python web applications face this problem.

That is, since Pyhton codecs follow unicode.org spec, each developper has to invent how to support web's 'sloppy' encodings.

ftfy solves part of the problems, but just creating codecs following encoding.spec.whatwg seems the obvious solution, and actually ftfy author himself @rspeer proposed including them in stdlib. https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

But aside from stdlib discussion, I couldn't find any other 3rd party libraries, popular solutions, or document or evidence that says it's not worth it if it is so. (At least w3lib doesn't do anything about it).

What people are thinking and doing?