python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

codecs missing: base64 bz2 hex zlib hex_codec ... #51724

Closed eda57068-96ad-4b33-8431-9c528f59a6a6 closed 10 years ago

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago
BPO 7475
Nosy @malemburg, @loewis, @warsaw, @birkenfeld, @gpshead, @jcea, @cben, @ncoghlan, @abalkin, @vstinner, @benjaminp, @jwilk, @ezio-melotti, @merwok, @bitdancer, @ssbarnea, @florentx, @akheron, @serhiy-storchaka, @phmc
Dependencies
  • bpo-17828: More informative error handling when encoding and decoding
  • bpo-17839: base64 module should use memoryview
  • bpo-17844: Add link to alternatives for bytes-to-bytes codecs
  • Files
  • issue7475_warning.diff: Patch for documentation and warnings in 2.7
  • issue7475_missing_codecs_py3k.diff: Patch, apply to trunk
  • issue7475_restore_codec_aliases_in_py34.diff: Patch to restore the transform aliases.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/ncoghlan' closed_at = created_at = labels = ['type-feature', 'library', 'expert-unicode'] title = 'codecs missing: base64 bz2 hex zlib hex_codec ...' updated_at = user = 'https://github.com/florentx' ``` bugs.python.org fields: ```python activity = actor = 'python-dev' assignee = 'ncoghlan' closed = True closed_date = closer = 'python-dev' components = ['Library (Lib)', 'Unicode'] creation = creator = 'flox' dependencies = ['17828', '17839', '17844'] files = ['15523', '15526', '32663'] hgrepos = [] issue_num = 7475 keywords = ['patch'] message_count = 95.0 messages = ['96218', '96223', '96226', '96227', '96228', '96232', '96236', '96237', '96240', '96242', '96243', '96251', '96253', '96265', '96277', '96295', '96296', '96301', '96374', '96632', '106669', '106670', '106674', '107057', '107794', '109872', '109876', '109879', '109894', '109904', '109905', '123090', '123154', '123206', '123435', '123436', '123462', '123693', '125073', '145246', '145656', '145693', '145897', '145900', '145979', '145980', '145982', '145986', '145991', '145998', '149439', '153304', '153317', '164224', '164226', '164237', '165435', '170414', '187630', '187631', '187634', '187636', '187638', '187644', '187649', '187651', '187652', '187653', '187660', '187668', '187670', '187673', '187676', '187695', '187696', '187698', '187701', '187702', '187705', '187707', '187764', '187770', '198845', '198846', '202130', '202264', '202515', '203124', '203378', '203751', '203936', '203942', '203944', '207283', '213502'] nosy_count = 22.0 nosy_names = ['lemburg', 'loewis', 'barry', 'georg.brandl', 'gregory.p.smith', 'jcea', 'cben', 'ncoghlan', 'belopolsky', 'vstinner', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'ssbarnea', 'flox', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka', 'pconnell', 'isoschiz'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue7475' versions = ['Python 3.4'] ```

    akheron commented 12 years ago

    bpo-13600 has been marked as a duplicate of this issue.

    FRT, +1 to the idea of adding encoded_format and decoded_format attributes to CodecInfo, and also to adding {str,bytes}.{transform,untransform} back.

    vstinner commented 12 years ago

    What is the status of this issue? Is there still a fan of this issue motivated to write a PEP, a patch or something like that?

    ncoghlan commented 12 years ago

    It's still on my radar to come back and have a look at it. Feedback from the web folks doing Python 3 migrations is that it would have helped them in quite a few cases.

    I want to get a couple of other open PEPs out of the way first, though (mainly 394 and 409)

    ncoghlan commented 12 years ago

    My current opinion is that this should be a PEP for 3.4, to make sure we flush out all the corner cases and other details correctly.

    ncoghlan commented 12 years ago

    For that matter, with the relevant codecs restored in 3.2, a transform() helper could probably be added to six (or a new project on PyPI) to prototype the approach.

    ncoghlan commented 12 years ago

    Setting as a release blocker for 3.4 - this is important.

    ncoghlan commented 12 years ago

    FWIW it's, I've been thinking further about this recently and I think implementing this feature as builtin methods is the wrong way to approach it.

    Instead, I propose the addition of codecs.encode and codecs.decode methods that are type neutral (leaving any type checks entirely up to the codecs themselves), while the str.encode and bytes.decode methods retain their current strict test model related type restrictions.

    Also, I now think my previous proposal for nice error messages was massively over-engineered. A much simpler approach is to just replace the status quo:

    >>> "".encode("bz2_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ncoghlan/devel/py3k/Lib/encodings/bz2_codec.py", line 17, in bz2_encode
        return (bz2.compress(input), len(input))
      File "/home/ncoghlan/devel/py3k/Lib/bz2.py", line 443, in compress
        return comp.compress(data) + comp.flush()
    TypeError: 'str' does not support the buffer interface
    with a better error with more context like:

    UnicodeEncodeError: encoding='bz2_codec', errors='strict', codec_error="TypeError: 'str' does not support the buffer interface"

    A similar change would be straightforward on the decoding side.

    This would be a good use case for __cause__, but the codec error should still be included in the string representation.

    605dc02b-e158-40eb-af20-bd4c133e2b69 commented 12 years ago

    Many have chimed in on this topic but I thought I would lend my stance--for whatever it is worth.

    I also believe most of these do not fit concept of a character codec and some sort of transforms would likely be useful, however most are sort of specialized (e.g., there should probably be a generalized compression library interface al la hashlib):

    rot13: a (albeit simplistic) text cipher (str to str; though bytes to bytes could be argued since since many crypto functions do that)

    zlib, bz2, etc. (lzma/xz should also be here): all bytes to bytes compression transforms

    hex(adecimal) uu, base64, etc.: these more or less fit the description of a character codec as they map between bytes and str, however, I am not sure they are really the same thing as these are basically doing a radix transformation to character symbols and the mapping it not strictly from bytes to a single character and back as a true character codec seems to imply. As evidenced by by int() format() and bytes.fromhex(), float.hex(), float.fromhex(), etc., these are more generalized conversions for serializing strings of bits into a textual representation (possibly for human consumption).

    I personally feel any \<type/class>.hex(), etc. method would be better off as a format() style formatter if they are to exist in such a space at all (i.e., not some more generalized conversion library--which we have but since 3.x could probably use to be updated and cleaned up).

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 11 years ago

    Another rant, because it matters to many of us: http://lucumr.pocoo.org/2012/8/11/codec-confusion/

    IMHO, the solution to restore str.decode and bytes.encode and return TypeError for improper use is probably the most obvious for the average user.

    ezio-melotti commented 11 years ago

    -1 I see encoding as the process to go from text to bytes, and decoding the process to go from bytes to text, so (ab)using these terms for other kind of conversions is not an option IMHO.

    Anyway I think someone should write a PEP and list the possible options and their pro and cons, and then a decision can be taken on python-dev.

    FTR in Python 2 you can use decode for bytes->text, text->text, bytes->bytes, and even text->bytes: u'DEADBEEF'.decode('hex') '\xde\xad\xbe\xef'

    bitdancer commented 11 years ago

    transform/untransform has approval-in-principle, adding encode/decode to the type that doesn't have them has been explicitly (and repeatedly :) rejected.

    (I don't know about anybody else, but at this point I have written code that assumes that if an object has an 'encode' method, calling it will get me a bytes, and vice versa with 'decode'...an assumption I know is not "safe", but that I feel is useful duck typing in the contexts in which I used it.)

    Nick wants a PEP, other people have said a PEP isn't necessary. What is certainly necessary is for someone to pick up the ball and run with it.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 11 years ago

    I am not a native english speaker, but it seems that the common usage of encode/decode is wider than the restricted definition applied for Python 3.3:

    Some examples:

    However, I acknowledge that there are valid reasons to choose a different verb too.

    ezio-melotti commented 11 years ago

    While not strictly necessary, a PEP would be certainly useful and will help reaching a consensus. The PEP should provide a summary of the available options (transform/untransforms, reintroducing encode/decode for bytes/str, maybe others), their intended behavior (e.g. is type(x.transform()) == type(x) always true?), and possible issues (e.g. Should some transformations be limited to str or bytes? Should rot13 work with both transform and untransform?). Even if we all agreed on a solution, such document would still be useful IMHO.

    ncoghlan commented 11 years ago

    +1 for someone stepping up to write a PEP on this if they would like to see the situation improved in 3.4.

    transform/untransform has at least one core developer with an explicit -1 on the proposal at the moment (me).

    We *definitely* need a generic object->object convenience API in the codecs module (codecs.decode, codecs.encode). I even accept that those two functions could be worthy of elevation to be new builtin functions.

    I'm *far* from convinced that awkwardly named methods that only handle str->object, bytes->object and bytearray->object are a good idea. Should memoryview gain transform/untransform methods as well?

    transform/untransform as proposed aren't even inverse operations, since they don't swap the valid input and output types (that is, transform is str/bytes/bytearray to arbitrary objects, while untransform is *also* str/bytes/bytearray to arbitrary objects. Inverses can't have a domain/range mismatch like that).

    Those names are also ambiguous about which one corresponds to "encoding" and which to "decoding". encode() and decode(), whether as functions in the codecs module or as builtins, have no such issue.

    Personally, the more I think about it, the more I'm in favour of adding encode and decode as builtin functions for 3.4. If you want arbitrary object->object conversions, use the builtins, if you want strict str->bytes or bytes/bytearray->str use the methods. Python 3 has been around long enough now, and Python 3.2 and 3.3 are sufficiently well known that I think we can add the full power builtins without people getting confused.

    bitdancer commented 11 years ago

    I was visualizing transform/untransform as being restricted to buffertype->bytes and stringtype->string, which at least for binascii-type transforms is all the modules support. After all, you don't get to choose what type of object you get back from encode or decode.

    A more generalized transformation (encode/decode) utility is also interesting, but how many non-string non-bytes transformations do we actually support?

    ncoghlan commented 11 years ago

    If transform is a method, how do you plan to accept arbitrary buffer supporting types as input?

    This is why I mentioned memoryview: it doesn't provide decode(), but there's no good reason you should have to copy the data from the view before decoding it. Similarly, you shouldn't have to make an unaltered copy before creating a compressed (or decompressed) copy.

    With codecs.encode and codecs.decode as functions, supporting memoryview as an input for bytes->str decoding, binary->bytes encoding (e.g. gzip compression) and binary->bytes decoding (e.g. gzip decompression) is trivial. Ditto for array.array and anything else that supports the buffer protocol.

    With transform/untransform as methods? No such luck.

    And once you're using functions rather than methods, it's best to define the API as object -> object, and leave any type constraints up to the individual codecs (with the error handling improved to provide more context and a more meaningful exception type, as I described earlier in the thread)

    bitdancer commented 11 years ago

    I agree with you. transform/untransform are parallel to encode/decode, and I wouldn't expect them to exist on any type that didn't support either encode or decode. They are convenience methods, just as encode/decode are.

    I am also probably not invested enough in it to write the PEP :)

    gvanrossum commented 11 years ago

    str.decode() and bytes.encode() are not coming back.

    Any proposal had better take into account the API design rule that the *type of a method's return value should not depend on the *value of one of the arguments. (The Python 2 design failed this test, and that's why we changed it.)

    It is however fine to let the return type depend on one of the argument *types*. So e.g. bytes.transform(enc) -> bytes and str.transform(enc) -> str are fine. And so are e.g. transform(bytes, enc) -> bytes and transform(str, enc) -> str. But a transform() taking bytes that can return either str or bytes depending on the encoding name would be a problem.

    Personally I don't think transformations are so important or ubiquitous so as to deserve being made new bytes/str methods. I'd be happy with a convenience function, for example transform(input, codecname), that would have to be imported from somewhere (maybe the codecs module).

    My guess is that in almost all cases where people are demanding to say e.g.

      x = y.transform('rot13')

    the codec name is a fixed literal, and they are really after minimizing the number of imports. Personally, disregarding the extra import line, I think

      x = rot13.transform(y)

    looks better though. Such custom APIs also give the API designer (of the transformation) more freedom to take additional optional parameters affecting the transformation, offer a set of variants, or a richer API.

    birkenfeld commented 11 years ago

    FWIW, I'm not interested in seeing this added anymore.

    gpshead commented 11 years ago

    consensus here appears to be "bad idea... don't do this."

    ncoghlan commented 11 years ago

    No, transform/untransform as methods are a bad idea, but these *codecs* should definitely come back.

    The minimal change needed for that to be feasible is to give errors raised during encoding and decoding more context information (at least the codec name and error mode, and switching to the right kind of error).

    MAL also stated on python-dev that codecs.encode and codecs.decode already exist, so it should just be a matter of documenting them properly.

    gpshead commented 11 years ago

    okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes\<->unicode.

    ncoghlan commented 11 years ago

    The codecs module is generic, text encodings are just the most common use case (hence the associated method API).

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 11 years ago

    I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do

      import codecs
      result = codecs.getencoder("base64").encode(data)

    I don't think people would actually prefer this over

      import base64
      result = base64.encodebytes(data)

    I't (IMO) only the convenience method (.encode) that made people love these codecs.

    ezio-melotti commented 11 years ago

    IMHO it's also a documentation problem. Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead. By reading the codecs docs0 it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead. FWIW I don't care about having to do an extra import, but indeed something simpler than codecs.getencoder("...").encode/decode would be nice.

    ncoghlan commented 11 years ago

    It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:

    http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

    So, all the way from 2.4 to 2.7 you can write:

      from codecs import encode
      result = encode(data, "base64")

    It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:

    >>> encode(b"example", "base64_codec")
    b'ZXhhbXBsZQ==\n'
    >>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
    b'example'

    Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky):

    >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
        return (base64.decodebytes(input), len(input))
      File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
        raise TypeError("expected bytes, not %s" % s.__class__.__name__)
    TypeError: expected bytes, not memoryview

    I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases.

    malemburg commented 11 years ago

    Just copying some details here about codecs.encode() and codec.decode() from python-dev:

    """ Just as reminder: we have the general purpose encode()/decode() functions in the codecs module:

    import codecs
    r13 = codecs.encode('hello world', 'rot-13')

    These interface directly to the codec interfaces, without enforcing type restrictions. The codec defines the supported input and output types. """

    As Nick found, these aren't documented, which is a documentation bug (I probably forgot to add documentation back then). They have been in Python since 2004:

    http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598

    These API are nice for general purpose codec work and that's why I added them back in 2004.

    For the codecs in question, it would still be nice to have a more direct way to access them via methods on the types that you typically use them with.

    ezio-melotti commented 11 years ago

    It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:

    FTR this is because of ff1261a14573 (see bpo-10807).

    ncoghlan commented 11 years ago

    bpo-17827 covers adding documentation for codecs.encode and codecs.decode

    bpo-17828 covers adding exception handling improvements for all encoding and decoding operations

    ncoghlan commented 11 years ago

    For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously).

    By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in bpo-10807, the encoding and decoding error messages should first be improved as discussed in bpo-17828.

    ncoghlan commented 11 years ago

    Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP-3118 compliant objects as input.

    ncoghlan commented 11 years ago

    I also created bpo-17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released.

    ncoghlan commented 11 years ago

    With bpo-17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible:

    >>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: decoder did not return a str object (type=bytes)
    ncoghlan commented 11 years ago

    I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring\<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed.

    In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important.

    ncoghlan commented 10 years ago

    For anyone interested, I have a patch up on bpo-17828 that produces the following output for various codec usage errors:

    >>> import codecs
    >>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
    
    >>> "hello".encode("bz2_codec")
    TypeError: 'str' does not support the buffer interface
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
    
    >>> "hello".encode("rot_13")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
    ncoghlan commented 10 years ago

    Providing the 2to3 fixers in bpo-17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit).

    ncoghlan commented 10 years ago

    bpo-17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3.

    bpo-19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7.

    ncoghlan commented 10 years ago

    Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document.

    I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices.

    Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher.

    ncoghlan commented 10 years ago

    Victor is still -1, so to Python 3.5 it goes.

    ncoghlan commented 10 years ago

    The 3.4 portion of bpo-19619 has been addressed, so removing it as a dependency again.

    ncoghlan commented 10 years ago

    With bpo-19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897

    I'll be committing this shortly, after adjusting the patch to account for the bpo-19619 changes to the tests and What's New.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 10 years ago

    New changeset 5e960d2c2156 by Nick Coghlan in branch 'default': Close bpo-7475: Restore binary & text transform codecs http://hg.python.org/cpython/rev/5e960d2c2156

    ncoghlan commented 10 years ago

    Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc.

    I'll probably aim for beta 2 for that.

    serhiy-storchaka commented 10 years ago

    Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 10 years ago

    New changeset d7950e916f20 by R David Murray in branch '3.3': bpo-7475: Remove references to '.transform' from transform codec docstrings. http://hg.python.org/cpython/rev/d7950e916f20

    New changeset 83d54ab5c696 by R David Murray in branch 'default': Merge bpo-7475: Remove references to '.transform' from transform codec docstrings. http://hg.python.org/cpython/rev/83d54ab5c696