python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

Closed eda57068-96ad-4b33-8431-9c528f59a6a6 closed 10 years ago

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago
BPO 7475
Nosy @malemburg, @loewis, @warsaw, @birkenfeld, @gpshead, @jcea, @cben, @ncoghlan, @abalkin, @vstinner, @benjaminp, @jwilk, @ezio-melotti, @merwok, @bitdancer, @ssbarnea, @florentx, @akheron, @serhiy-storchaka, @phmc
Dependencies
  • bpo-17828: More informative error handling when encoding and decoding
  • bpo-17839: base64 module should use memoryview
  • bpo-17844: Add link to alternatives for bytes-to-bytes codecs
  • Files
  • issue7475_warning.diff: Patch for documentation and warnings in 2.7
  • issue7475_missing_codecs_py3k.diff: Patch, apply to trunk
  • issue7475_restore_codec_aliases_in_py34.diff: Patch to restore the transform aliases.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/ncoghlan' closed_at = created_at = labels = ['type-feature', 'library', 'expert-unicode'] title = 'codecs missing: base64 bz2 hex zlib hex_codec ...' updated_at = user = 'https://github.com/florentx' ``` bugs.python.org fields: ```python activity = actor = 'python-dev' assignee = 'ncoghlan' closed = True closed_date = closer = 'python-dev' components = ['Library (Lib)', 'Unicode'] creation = creator = 'flox' dependencies = ['17828', '17839', '17844'] files = ['15523', '15526', '32663'] hgrepos = [] issue_num = 7475 keywords = ['patch'] message_count = 95.0 messages = ['96218', '96223', '96226', '96227', '96228', '96232', '96236', '96237', '96240', '96242', '96243', '96251', '96253', '96265', '96277', '96295', '96296', '96301', '96374', '96632', '106669', '106670', '106674', '107057', '107794', '109872', '109876', '109879', '109894', '109904', '109905', '123090', '123154', '123206', '123435', '123436', '123462', '123693', '125073', '145246', '145656', '145693', '145897', '145900', '145979', '145980', '145982', '145986', '145991', '145998', '149439', '153304', '153317', '164224', '164226', '164237', '165435', '170414', '187630', '187631', '187634', '187636', '187638', '187644', '187649', '187651', '187652', '187653', '187660', '187668', '187670', '187673', '187676', '187695', '187696', '187698', '187701', '187702', '187705', '187707', '187764', '187770', '198845', '198846', '202130', '202264', '202515', '203124', '203378', '203751', '203936', '203942', '203944', '207283', '213502'] nosy_count = 22.0 nosy_names = ['lemburg', 'loewis', 'barry', 'georg.brandl', 'gregory.p.smith', 'jcea', 'cben', 'ncoghlan', 'belopolsky', 'vstinner', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'ssbarnea', 'flox', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka', 'pconnell', 'isoschiz'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue7475' versions = ['Python 3.4'] ```

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    AFAIK these codecs were not ported to Python 3.

    1. I found no hint in documentation on this matter.

    2. Is it possible to contribute some of them, or there's a good reason to look elsewhere?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    These are not encodings, in that they don't convert characters to bytes. It was a mistake that they were integrated into the codecs interfaces in Python 2.x; this mistake is corrected in 3.x.

    malemburg commented 14 years ago

    Martin v. Löwis wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    These are not encodings, in that they don't convert characters to bytes. It was a mistake that they were integrated into the codecs interfaces in Python 2.x; this mistake is corrected in 3.x.

    Martin, I beg your pardon, but these codecs indeed implement valid encodings and the fact that these codecs were removed was a mistake.

    They should be readded to Python 3.x.

    Note that just because a codec doesn't convert between bytes and characters only, doesn't make it wrong in any way. The codec architecture in Python is designed to support same type encodings just as well as ones between bytes and characters.

    malemburg commented 14 years ago

    Reopening the ticket.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    It's not possible to add these codecs back. Bytes objects (correctly) don't have an encode method, and string objects (correctly) don't have a decode method. The codec architecture of Python 3.x just doesn't support this kind of application; the codec architecture of 2.x was flawed.

    benjaminp commented 14 years ago

    I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal strictly with unicode -> bytes.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    «Everything you thought you knew about binary data and Unicode has changed.»

    Reopening for the documentation part.

    This "mistake" deserves some words in the documentation: docs.python.org/dev/py3k/whatsnew/3.0.html

    text-vs-data-instead-of-unicode-vs-8-bit

    And the conversion may be automated with 2to3, maybe.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Is it possible to add "DeprecationWarning" for these codecs when using "python -3" ?

    >>> {}.has_key('a')
    __main__:1: DeprecationWarning: dict.has_key() not supported in 3.x;
                use the in operator
    False
    >>> print `123`
    <stdin>:1: SyntaxWarning: backquote not supported in 3.x; use repr()
    123
    >>> 'abc'.encode('base64')
    'YWJj\n'
    malemburg commented 14 years ago

    Martin v. Löwis wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    It's not possible to add these codecs back. Bytes objects (correctly) don't have an encode method, and string objects (correctly) don't have a decode method. The codec architecture of Python 3.x just doesn't support this kind of application; the codec architecture of 2.x was flawed.

    Of course it does support these kinds of codecs. The codec architecture hasn't changed between 2.x and 3.x, just the way a few methods work.

    All we agreed to is that unicode.encode() will only return bytes, while bytes.decode() will only return unicode. So the methods won't support same type conversions, because Guido didn't want to have methods that return different types based on the chosen parameter (the codec name in this case).

    However, you can still use codecs.encode() and codecs.decode() to work with codecs that return different combinations of types. I explicitly added that support back to 3.0.

    You can't argue that just because two methods don't support a certain type combination, the whole architecture doesn't support this anymore.

    Also note that codecs allow a much more far-reaching use than just through the unicode and bytes methods: you can use them as seamless wrappers for streams, subclass from them, use their methods directly, etc. etc.

    So your argument that just because the two methods don't support these codecs anymore is just not good enough to warrant their removal.

    malemburg commented 14 years ago

    Benjamin Peterson wrote:

    Benjamin Peterson \benjamin@python.org\ added the comment:

    I agree with Martin. gzip and bz2 convert bytes to bytes. Encodings deal strictly with unicode -> bytes.

    Sorry, Bejamin, but that's simply not true.

    Codecs can work with arbitrary types, it's just that the helper methods on unicode and bytes objects only support one combination of types in Python 3.x.

    codecs.encode()/.decode() provide access to all codecs, regardless of their supported type combinations and of course, you can use them directly via the codec registry, subclass from them, etc.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Thinking about it, I am +1 to reimplement the codecs.

    We could implement new methods to replace the old one. (similar to base64.encodebytes and base64.decodebytes)

    >>> b'abc'.encodebytes('base64')
    b'YWJj\n'
    >>> b'abc'.encodebytes('zlib').encodebytes('base64')
    b'eJxLTEoGAAJNASc=\n'
    >>> b'UHl0aG9u'.decodebytes('base64').decode('utf-8')
    'Python'
    benjaminp commented 14 years ago

    2009/12/11 Marc-Andre Lemburg \report@bugs.python.org\:

    codecs.encode()/.decode() provide access to all codecs, regardless of their supported type combinations and of course, you can use them directly via the codec registry, subclass from them, etc.

    Didn't you have a proposal for bytes.transform/untransform for operations like this?

    malemburg commented 14 years ago

    Benjamin Peterson wrote:

    Benjamin Peterson \benjamin@python.org\ added the comment:

    2009/12/11 Marc-Andre Lemburg \report@bugs.python.org\: > codecs.encode()/.decode() provide access to all codecs, regardless > of their supported type combinations and of course, you can use > them directly via the codec registry, subclass from them, etc.

    Didn't you have a proposal for bytes.transform/untransform for operations like this?

    Yes. At the time it was postponed, since I brought it up late in the 3.0 release process. Perhaps I should bring it up again.

    Note that those methods are just convenient helpers to access the codecs and as such only provide limited functionality.

    The full machinery itself is accessible via the codecs module and the code in the encodings package. Any decision to include a codec or not needs to be based on whether it fits the framework in those modules/packages, not the functionality we expose on unicode and bytes objects.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    I've ported the codecs from Py2: base64, bytes_escape, bz2, hex, quopri, rot13, uu and zlib

    It's not a big deal. Basically:

    Will add documentation if we agree on the feature.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    codecs.encode()/.decode() provide access to all codecs, regardless of their supported type combinations and of course, you can use them directly via the codec registry, subclass from them, etc.

    I presume that the OP didn't talk about codecs.encode, but about the methods on string objects. flox, can you clarify what precisely it is that you miss?

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Martin,

    actually, I was trying to convert some piece of code from python2 to python3. And this statement was not converted by 2to3: "x.decode('base64').decode('zlib')"

    So, I read the official documentation, and found no hint about the removal of these codecs. For my specific use case, I can use "zlib.decompress" and "base64.decodebytes", but I find that the ".encode()" and ".decode()" helpers were useful in Python 2.

    I don't know all the background of the removal of these codecs. But I try to contribute to Python, and help Python 3 become at least as featureful, and useful, as Python 2.

    So, after reading the above comments, I think we may end up with following changes:

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    And this statement was not converted

    s/this statement/this method call/

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    So, after reading the above comments, I think we may end up with following changes:

    • restore the "bytes-to-bytes" codecs in the "encodings" package
    • then create new helpers on bytes objects (either ".transform()/.untransform()" or ".encodebytes()/.decodebytes")

    I would still be opposed to such a change, and I think it needs a PEP. If the codecs are restored, one half of them becomes available to .encode/.decode methods, since the codec registry cannot tell which ones implement real character encodings, and which ones are other conversion methods. So adding them would be really confusing.

    I also wonder why you are opposed to the import statement. My recommendation is indeed that you use the official API for these libraries (and indeed, there is an official API for each of them, unlike real codecs, which don't have any other documented API).

    malemburg commented 14 years ago

    Martin v. Löwis wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    > So, after reading the above comments, I think we may end up with > following changes: > * restore the "bytes-to-bytes" codecs in the "encodings" package

    +1

    > * then create new helpers on bytes objects (either > ".transform()/.untransform()" or ".encodebytes()/.decodebytes")

    +1 - the names are still up for debate, IIRC.

    I would still be opposed to such a change, and I think it needs a PEP.

    All this has already been discussed and the only reason it didn't go in earlier was timing. No need for a PEP.

    If the codecs are restored, one half of them becomes available to .encode/.decode methods, since the codec registry cannot tell which ones implement real character encodings, and which ones are other conversion methods. So adding them would be really confusing.

    Not at all. The helper methods check the return types and raise an exception if the types don't match the expected types.

    The codecs registry itself doesn't need to know about the possible input/output types of codecs, since this information is not required to match a name to an implementation.

    What we could do, is add that information to the CodecInfo object used for registering the codec. codecs.lookup() would then return the information to the application.

    E.g.

    .encode_input_types = (str,)
    .encode_output_types = (bytes,)
    .decode_input_types = (bytes,)
    .decode_output_types = (str,)

    Codecs not supporting these CodecInfo attributes would simply return None.

    I also wonder why you are opposed to the import statement. My recommendation is indeed that you use the official API for these libraries (and indeed, there is an official API for each of them, unlike real codecs, which don't have any other documented API).

    That's not the point. The codec API provides a standardized API for all these encodings. The hex, zlib, bz2, etc. codecs are just adapters of the different pre-existing APIs to the codec API.

    birkenfeld commented 14 years ago

    I also seem to recall that adding .transform()/.untransform() was already accepted at some point.

    vstinner commented 14 years ago

    I agree with Martin: codecs choosed the wrong direction in Python2, and it's fixed in Python3. The codecs module is related to charsets (encodings), should encode str to bytes, and should decode bytes (or any read buffer) to str.

    Eg. rot13 "encodes" str to str.

    "base64 bz2 hex zlib ...": use base64, bz2, binascii and zlib modules for that.

    The documentation should be fixed (explain how to port code from Python2 to Python3).

    It's maybe possible for write some 2to3 fixers for the following examples:

    "...".encode("base64") => base64.b64encode("...") "...".encode("rot13") => do nothing (but display a warning?) "...".encode("zlib") => zlib.compress("...") "...".encode("hex") => base64.b16encode("...") "...".encode("bz2") => bz2.compress("...")

    "...".decode("base64") => base64.b64decode("...") "...".decode("rot13") => do nothing (but display a warning?) "...".decode("zlib") => zlib.decompress("...") "...".decode("hex") => base64.b16decode("...") "...".decode("bz2") => bz2.decompress("...")

    vstinner commented 14 years ago

    Explanation the change in Python3 by Guido:

    "We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

    http://www.artima.com/weblogs/viewpost.jsp?thread=208549

    --

    See also issue bpo-8838.

    malemburg commented 14 years ago

    STINNER Victor wrote:

    STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

    I agree with Martin: codecs choosed the wrong direction in Python2, and it's fixed in Python3. The codecs module is related to charsets (encodings), should encode str to bytes, and should decode bytes (or any read buffer) to str.

    No, that's just not right: the codec system in Python does not mandate the types used or accepted by the codecs.

    The only change that was applied in Python3 was to make sure that the str.encode() and bytes.decode() methods always return the same type to assure type-safety.

    Python2 does not apply that check, but instead provides a direct interface to codecs.encode() and codecs.decode().

    Please don't mix the helper methods on those objects with what the codec system was designed for. The helper methods apply a strategy that's more constrained than the codec system.

    The addition of .transform() and .untransform() for same type conversions was discussed in 2008, but didn't make it into 3.0 since I hadn't had time to add the methods:

    http://mail.python.org/pipermail/python-3000/2008-August/014533.html http://mail.python.org/pipermail/python-3000/2008-August/014533.html http://mail.python.org/pipermail/python-3000/2008-August/014534.html

    The removed codecs don't rely on the helper methods in any way. They are easily usable via codecs.encode() and codecs.decode() even without .transform() and .untransform().

    Esp. the hex codec is very handy and at least in our eGenix code base in wide-spread use. Using a single well-defined interface to such encodings is just much more user friendly than having to research the different APIs for each of them.

    merwok commented 14 years ago

    Related: bytes vs. str for base64 encoding in email, bpo-8896

    0d272f2d-ac69-44ce-900d-8b7d0114cb9d commented 14 years ago

    I would like to know what happened with hex_codec and what is the new py3 for this.

    Also, it would be really helpful to see DeprecationWarnings for all these codecs in py2x and include a note in py3 changelist.

    The official python documentation from http://docs.python.org/library/codecs.html lists them as valid without any signs of them as being dropped or replaced.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    I would like to know what happened with hex_codec and what is the new py3 for this.

    If you had read this bug report, you'd know that the codec was removed in Python 3. Use binascii.hexlify/binascii.unhexlify instead (as you should in 2.x, also).

    malemburg commented 14 years ago

    Martin v. Löwis wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    > I would like to know what happened with hex_codec and what is the new py3 for this.

    If you had read this bug report, you'd know that the codec was removed in Python 3. Use binascii.hexlify/binascii.unhexlify instead (as you should in 2.x, also).

    ... or wait for Python 3.2 which will readd them :-)

    birkenfeld commented 14 years ago

    ... but don't wait to long to add them!

    malemburg commented 14 years ago

    Georg Brandl wrote:

    Georg Brandl \georg@python.org\ added the comment:

    ... but don't wait to long to add them!

    I plan to work on that after EuroPython. Florent already provided the patch for the codecs, so what's left is adding the .transform()/ .untransform() methods, and perhaps tweak the codec input/output types in a couple of cases.

    merwok commented 14 years ago

    I am confused by MvL’s reply. From the first paragraph documentation for binascii: “Normally, you will not use these functions directly but use wrapper modules like uu, base64, or binhex instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.”

    Is the doc not accurate?

    Also, can someone not unsure about the status of this report edit the type, stage, component and resolution? It would be helpful.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 14 years ago

    I am confused by MvL’s reply. From the first paragraph documentation for binascii: “Normally, you will not use these functions directly but use wrapper modules like uu, base64, or binhex instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.”

    Is the doc not accurate?

    It is correct. So use base64.b16encode/b16decode then. It's just that I personally prefer hexlify/unhexlify, because I can memorize the function name better.

    birkenfeld commented 13 years ago

    Codecs brought back and (un)transform implemented in r86934.

    abalkin commented 13 years ago

    I am probably a bit late to this discussion, but why these things should be called "codecs" and why should they share the registry with the encodings? It looks like the proper term would be "transformations" or "transforms".

    malemburg commented 13 years ago

    Alexander Belopolsky wrote:

    Alexander Belopolsky \belopolsky@users.sourceforge.net\ added the comment:

    I am probably a bit late to this discussion, but why these things should be called "codecs" and why should they share the registry with the encodings? It looks like the proper term would be "transformations" or "transforms".

    .transform() is just the name of the method. The codecs are still just that: codecs, i.e. objects that encode and decode data. The types they support are defined by the codecs, not by the helper methods.

    In Python3, the str and bytes methods .encode() and .decode() will only support str->bytes->str conversions. The new str and bytes .transform() method adds back str->str and bytes->bytes.

    The codec subsystem does not impose restrictions on the type combinations a codec can support, and that's per design.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 13 years ago

    As per

    http://mail.python.org/pipermail/python-dev/2010-December/106374.html

    I think this checkin should be reverted, as it's breaking the language moratorium.

    birkenfeld commented 13 years ago

    I leave this to MAL, on whose behalf I finished this to be in time for beta.

    malemburg commented 13 years ago

    Martin v. Löwis wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    As per

    http://mail.python.org/pipermail/python-dev/2010-December/106374.html

    I think this checkin should be reverted, as it's breaking the language moratorium.

    I've asked Guido. We may have to revert the addition of the new methods and then readd them for 3.3, but I don't really see them as difficult to implement for the other Python implementations, since they are just interfaces to the codec sub-system.

    The readdition of the codecs and changes to support them in the codec system do not fall under the moratorium, since they are stdlib changes.

    abalkin commented 13 years ago

    With Georg's approval, I am reopening this issue until a decision is made on whether {str,bytes,bytearray}.{transform,untransform} methods should go into 3.2.

    I am adding Guido to "nosy" because the decision may turn on the interpretation of his post. [1]

    I also started a python-dev thread on this issue. [2]

    [1] http://mail.python.org/pipermail/python-dev/2010-December/106374.html [2] http://mail.python.org/pipermail/python-dev/2010-December/106617.html

    vstinner commented 13 years ago

    See issue bpo-10807: 'base64' can be used with bytes.decode() (and str.encode()), but it raises a confusing exception (TypeError: expected bytes, not memoryview).

    merwok commented 13 years ago

    So. This was reverted before 3.2 was out, right? What is the status for 3.3?

    vstinner commented 13 years ago

    What is the status of this issue?

    rot13 codecs & friends were added back to Python 3.2 with {bytes,str}.(un)transform() methods: commit 7e4833764c88. Codecs were disabled because of surprising error messages before the release of Python 3.2 final: issue bpo-10807, commit ff1261a14573. transform() and untransform() methods were also removed, I don't remember why/how exactly, maybe because new codecs were disabled.

    So we have rot13 & friends in Python 3.2 and 3.3, but they cannot be used with the regular str.encode('rot13'), you have to write (for example):

    >>> codecs.getdecoder('rot_13')('rot13')
    ('ebg13', 5)
    >>> codecs.getencoder('rot_13')('ebg13')
    ('rot13', 5)

    The major issue with {bytes,str}.(un)transform() is that we have only one registry for all codecs, and the registry was changed in Python 3 to ensure:

    To implement str.transform(), we need another register. Marc-Andre suggested (msg96374) to add tags to codecs: """ .encode_input_types = (str,) .encode_output_types = (bytes,) .decode_input_types = (bytes,) .decode_output_types = (str,) """

    I'm still opposed to str->str (rot13) and bytes->bytes (hex, gzip, ...) operations using the codecs API. Developers have to use the right module. If the API of these modules is too complex, we should add helpers to these modules, but not to builtin types. Builtin types have to be and stay simple and well defined.

    merwok commented 13 years ago

    transform() and untransform() methods were also removed, I don't remember why/how exactly, I don’t remember either; maybe it was too late in the release process, or we lacked enough consensus.

    So we have rot13 & friends in Python 3.2 and 3.3, but they cannot be used with the regular str.encode('rot13'), you have to write (for example): codecs.getdecoder('rot_13') Ah, great, I thought they were not available at all!

    The major issue with {bytes,str}.(un)transform() is that we have only one registry for all codecs, and the registry was changed in Python 3 [...] To implement str.transform(), we need another register. Marc-Andre suggested (msg96374) to add tags to codecs I’m confused: does the tags idea replace the idea of adding another registry?

    I'm still opposed to str->str (rot13) and bytes->bytes (hex, gzip, ...) operations using the codecs API. Developers have to use the right module. Well, here I disagree with you and agree with MAL: str.encode and bytes.decode are strict, but the codec API in general is not restricted to str→bytes and bytes→str directions. Using the zlib or base64 modules vs. the codecs is a matter of style; sometimes you think it looks hacky, sometimes you think it’s very handy. And rot13 only exists as a codec!

    ncoghlan commented 13 years ago

    They were removed because adding new methods to builtin types violated the language moratorium.

    Now that the language moratorium is over, the transform/untransform convenience APIs should be added again for 3.3. It's an approved change, the original timing was just wrong.

    ncoghlan commented 13 years ago

    Sorry, I meant to state my rationale for the unassignment - I'm assuming this issue is covered by MAL's recent decision to step away from Unicode and codec maintenance issues. If that's incorrect, MAL can reclaim the issue, otherwise unassigning leaves it open for whoever wants to move it forward.

    ncoghlan commented 13 years ago

    Some further comments after getting back up to speed with the actual status of this problem (i.e. that we had issues with the error checking and reporting in the original 3.2 commit).

    1. I agree with the position that the codecs module itself is intended to be a type neutral codec registry. It encodes and decodes things, but shouldn't actually care about the types involved. If that is currently not the case in 3.x, it needs to be fixed.

    This type neutrality was blurred in 2.x by the fact that it only implemented str->str translations, and even further obscured by the coupling to the .encode() and .decode() convenience APIs. The fact that the type neutrality of the registry itself is currently broken in 3.x is a *regression, not an improvement. (The convenience APIs, on the other hand, are definitely *not type neutral, and aren't intended to be)

    1. To assist in producing nice error messages, and to allow restrictions to be enforced on type-specific convenience APIs, the CodecInfo objects should grow additional state as MAL suggests. To avoid redundancy (and inaccurate overspecification), my suggested colour for that particular bikeshed is:

    Character encoding codec: .decoded_format = 'text' .encoded_format = 'binary'

    Binary transform codec: .decoded_format = 'binary' .encoded_format = 'binary'

    Text transform codec: .decoded_format = 'text' .encoded_format = 'text'

    I suggest using the fuzzy format labels mainly due to the existence of the buffer API - most codec operations that consume binary data will accept anything that implements the buffer API, so referring specifically to 'bytes' in error messages would be inaccurate.

    The convenience APIs can then emit errors like:

    'a'.encode('rot_13') ==> CodecLookupError: text \<-> binary codec expected ('rot_13' is text \<-> text)

    'a'.decode('rot_13') ==> CodecLookupError: text \<-> binary codec expected ('rot_13' is text \<-> text)

    'a'.transform('bz2') ==> CodecLookupError: text \<-> text codec expected ('bz2' is binary \<-> binary)

    'a'.transform('ascii') ==> CodecLookupError: text \<-> text codec expected ('ascii' is text \<-> binary)

    b'a'.transform('ascii') ==> CodecLookupError: binary \<-> binary codec expected ('ascii' is text \<-> binary)

    For backwards compatibility with 3.2, codecs that do not specify their formats should be treated as character encoding codecs (i.e. decoded format is 'text', encoded format is 'binary')

    ncoghlan commented 13 years ago

    Oops, typo in my second error example. The command should be:

    b'a'.decode('rot_13')

    (Since str objects don't offer a decode() method any more)

    vstinner commented 13 years ago

    *.encode('rot_13') ==> CodecLookupError

    I like the idea of raising a lookup error on .encode/.decode if the codec is not a classic text codec (like ASCII or UTF-8).

    *.transform('ascii') ==> CodecLookupError

    Same comment.

    str.transform('bz2') ==> CodecLookupError

    A lookup error is surprising here. It may be a TypeError instead. The bz2 can be used with .transform, but not on str. So:

    (CodecLookupError doesn't exist, you propose to define a new exception who inherits from LookupError?)

    ncoghlan commented 13 years ago

    On Thu, Oct 20, 2011 at 8:34 AM, STINNER Victor \report@bugs.python.org\ wrote:

    > str.transform('bz2') ==> CodecLookupError

    A lookup error is surprising here. It may be a TypeError instead. The bz2 can be used with .transform, but not on str. So:

    No, it's the same concept as the other cases - we found a codec with the requested name, but it's not the kind of codec we wanted in the current context (i.e. str.transform). It may be that the problem is the user has a str when they expected to have a bytearray or a bytes object, but there's no way for the codec lookup process to know that.

     - Lookup error if the codec cannot be used with encode/decode or transform/untransform  - Type error if the value type is invalid

    There's no way for str.transform to tell the difference between "I asked for the wrong codec" and "I expected to have a bytes object here, not a str object". That's why I think we need to think in terms of format checks rather than type checks.

    (CodecLookupError doesn't exist, you propose to define a new exception who inherits from LookupError?)

    Yeah, and I'd get that to handle the process of creating the nice error messages. I think it may even make sense to build the filtering options into codecs.lookup() itself:

      def lookup(encoding, decoded_format=None,  encoded_format=None):
          info = _lookup(encoding) # The existing codec lookup algorithm
          if ((decoded_format is not None and decoded_format !=
    info.decoded_format) or
              (encoded_format is not None and encoded_format !=
    info.encoded_format)):
              raise CodecLookupError(info, decoded_format, encoded_format)

    Then the various encode, decode and transform methods can just pass the appropriate arguments to 'codecs.lookup' without all having to reimplement the format checking logic.

    vstinner commented 13 years ago

    I think it may even make sense to build the filtering options into codecs.lookup() itself:

    def lookup(encoding, decoded_format=None, encoded_format=None): info = _lookup(encoding) # The existing codec lookup algorithm if ((decoded_format is not None and decoded_format != info.decoded_format) or (encoded_format is not None and encoded_format != info.encoded_format)): raise CodecLookupError(info, decoded_format, encoded_format)

    lookup('rot13') should fail with a lookup error to keep backward compatibility. You can just change the default values to:

    def lookup(encoding, decoded_format='text',  encoded_format='binary'): ...

    If you patch lookup, what about the following functions?

    ncoghlan commented 13 years ago

    I'm fine with people needing to drop down to the lower level lookup() API if they want the filtering functionality in Python code. For most purposes, constraining the expected codec input and output formats really isn't a major issue - we just need it in the core in order to emit sane error messages when people misuse the convenience APIs based on things that used to work in 2.x (like 'a'.encode('base64')).

    At the C level, I'd adjust _PyCodec_Lookup to accept the two extra arguments and add _PyCodec_EncodeText, _PyCodec_DecodeBinary, _PyCodec_TransformText and _PyCodec_TransformBinary to support the convenience APIs (rather than needing the individual objects to know about the details of the codec tagging mechanism).

    Making new codecs available isn't a backwards compatibility problem - anyone relying on a particular key being absent from an extensible registry is clearly doing the wrong thing.

    Regarding the particular formats, I'd suggest that hex, base64, quopri, uu, bz2 and zlib all be flagged as binary transforms, but rot13 be implemented as a text transform (Florent's patch has rot13 as another binary transform, but it makes more sense in the text domain - this should just be a matter of adjusting some of the data types in the implementation from bytes to str)