python / cpython

The Python programming language
https://www.python.org
Other
62.44k stars 29.97k forks source link

Support arbitrary code page encodings on Windows #123803

Open serhiy-storchaka opened 2 weeks ago

serhiy-storchaka commented 2 weeks ago

Feature or enhancement

Python supports encodings that correspond to some code pages on Windows, like cp437 or cp1252. But every such encoding should be specially implemented. There are code pages that do not have corresponding codec implemented in Python.

But there are functions that allow to encode or decode using arbitrary code page: codecs.code_page_encode() and codecs.code_page_decode(). The only step left is to make them available as encodings, so they could be used in str.encode() and bytes.decode().

Currently this is already used for the current Windows (ANSI) code page. If the cpXXX encoding is not implemented in Python and XXX matches the value returned by GetACP(), "cpXXX" will be made an alias to the "mbcs" codec.

I propose to add support for arbitrary cpXXX encodings on Windows. If such encoding is not implemented directly, fall back to use the Windows-specific API.

Linked PRs

malemburg commented 2 weeks ago

FWIW, I don't think this is a good idea, since we'd lose the cross-platform compatibility, if codecs are only available on Windows and not on other platforms.

Overall, and as already stated in https://github.com/python/cpython/issues/123489, I don't think we should add more encodings to the stdlib set. Support for more esoteric encodings can easily be added via PyPI packages, if needed.

The stdlib already has good support for many encodings and we don't really need more. Only if new Unicode related codecs get standardized, we should add new ones, e.g. new transfer encodings. At the moment, the world is moving towards UTF-8 as the one and only encoding and that's good. In the future other transfer encodings may emerge, which are more efficient, so we should be open to add those, but for classic encodings ones we currently do not support, I think people can either use external tools such as iconv or one of the available PyPI packages.

serhiy-storchaka commented 1 week ago

This cat is already out of the bag. Currently, if cpXXX is not defined in Python, but matches the current Windows code page, it is mapped as an alias of the "mbcs" codec. So you can have different sets of supported codecs depending on the environment. I propose to make it this less environment depending -- provide the same set of encoding on all Windows machines (almost the same, as it may depend on the Windows version).

There are much more encodings in the world than Python supports. And it is not realistic to include all codecs. But some codecs are already here, provided by the OS. I think we should use codecs provided by the OS for better interoperability withing the platform, and also provide a set of codecs for inter-platform interoperability. We could even remove some codecs implemented in Python if they are well supported on all maintained platforms (but this is unlikely, because our own implementation may be more efficient). At least it will help us to reject requests for adding new codecs.

zooba commented 1 week ago

Yeah, I understand MAL's concern, but I'm inclined to agree with Serhiy on this one. Specifically:

I think we should use codecs provided by the OS for better interoperability withing the platform

My request would be (and I haven't checked the PR to see if it's there yet) that we have an error message that clearly suggests the encoding is not available on this platform (doesn't have to specify the actual platform), as opposed to merely that the encoding doesn't exist. If it's easy, having it be different from the generic error would be great. e.g. (on POSIX):

s.encode('cp1234')
LookupError: encoding cp1234 is not available on this platform
s.encode('spamalot')
LookupError: unknown encoding: spamalot