Closed eda57068-96ad-4b33-8431-9c528f59a6a6 closed 10 years ago
bpo-13600 has been marked as a duplicate of this issue.
FRT, +1 to the idea of adding encoded_format and decoded_format attributes to CodecInfo, and also to adding {str,bytes}.{transform,untransform} back.
What is the status of this issue? Is there still a fan of this issue motivated to write a PEP, a patch or something like that?
It's still on my radar to come back and have a look at it. Feedback from the web folks doing Python 3 migrations is that it would have helped them in quite a few cases.
I want to get a couple of other open PEPs out of the way first, though (mainly 394 and 409)
My current opinion is that this should be a PEP for 3.4, to make sure we flush out all the corner cases and other details correctly.
For that matter, with the relevant codecs restored in 3.2, a transform() helper could probably be added to six (or a new project on PyPI) to prototype the approach.
Setting as a release blocker for 3.4 - this is important.
FWIW it's, I've been thinking further about this recently and I think implementing this feature as builtin methods is the wrong way to approach it.
Instead, I propose the addition of codecs.encode and codecs.decode methods that are type neutral (leaving any type checks entirely up to the codecs themselves), while the str.encode and bytes.decode methods retain their current strict test model related type restrictions.
Also, I now think my previous proposal for nice error messages was massively over-engineered. A much simpler approach is to just replace the status quo:
>>> "".encode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ncoghlan/devel/py3k/Lib/encodings/bz2_codec.py", line 17, in bz2_encode
return (bz2.compress(input), len(input))
File "/home/ncoghlan/devel/py3k/Lib/bz2.py", line 443, in compress
return comp.compress(data) + comp.flush()
TypeError: 'str' does not support the buffer interface
with a better error with more context like:
UnicodeEncodeError: encoding='bz2_codec', errors='strict', codec_error="TypeError: 'str' does not support the buffer interface"
A similar change would be straightforward on the decoding side.
This would be a good use case for __cause__, but the codec error should still be included in the string representation.
Many have chimed in on this topic but I thought I would lend my stance--for whatever it is worth.
I also believe most of these do not fit concept of a character codec and some sort of transforms would likely be useful, however most are sort of specialized (e.g., there should probably be a generalized compression library interface al la hashlib):
rot13: a (albeit simplistic) text cipher (str to str; though bytes to bytes could be argued since since many crypto functions do that)
zlib, bz2, etc. (lzma/xz should also be here): all bytes to bytes compression transforms
hex(adecimal) uu, base64, etc.: these more or less fit the description of a character codec as they map between bytes and str, however, I am not sure they are really the same thing as these are basically doing a radix transformation to character symbols and the mapping it not strictly from bytes to a single character and back as a true character codec seems to imply. As evidenced by by int() format() and bytes.fromhex(), float.hex(), float.fromhex(), etc., these are more generalized conversions for serializing strings of bits into a textual representation (possibly for human consumption).
I personally feel any \<type/class>.hex(), etc. method would be better off as a format() style formatter if they are to exist in such a space at all (i.e., not some more generalized conversion library--which we have but since 3.x could probably use to be updated and cleaned up).
Another rant, because it matters to many of us: http://lucumr.pocoo.org/2012/8/11/codec-confusion/
IMHO, the solution to restore str.decode and bytes.encode and return TypeError for improper use is probably the most obvious for the average user.
-1 I see encoding as the process to go from text to bytes, and decoding the process to go from bytes to text, so (ab)using these terms for other kind of conversions is not an option IMHO.
Anyway I think someone should write a PEP and list the possible options and their pro and cons, and then a decision can be taken on python-dev.
FTR in Python 2 you can use decode for bytes->text, text->text, bytes->bytes, and even text->bytes: u'DEADBEEF'.decode('hex') '\xde\xad\xbe\xef'
transform/untransform has approval-in-principle, adding encode/decode to the type that doesn't have them has been explicitly (and repeatedly :) rejected.
(I don't know about anybody else, but at this point I have written code that assumes that if an object has an 'encode' method, calling it will get me a bytes, and vice versa with 'decode'...an assumption I know is not "safe", but that I feel is useful duck typing in the contexts in which I used it.)
Nick wants a PEP, other people have said a PEP isn't necessary. What is certainly necessary is for someone to pick up the ball and run with it.
I am not a native english speaker, but it seems that the common usage of encode/decode is wider than the restricted definition applied for Python 3.3:
Some examples:
RFC 4648 specifies "Base16, Base32, and Base64 Data Encodings" http://tools.ietf.org/html/rfc4648
About rot13: "the same code can be used for encoding and decoding" http://www.catb.org/~esr/jargon/html/R/rot13.html
The Huffman coding is "an entropy encoding algorithm" (used for DEFLATE) http://en.wikipedia.org/wiki/Huffman_coding
RFC 2616 lists (zlib's) deflate or gzip as "encoding transformations" http://tools.ietf.org/html/rfc2616#section-3.5
However, I acknowledge that there are valid reasons to choose a different verb too.
While not strictly necessary, a PEP would be certainly useful and will help reaching a consensus. The PEP should provide a summary of the available options (transform/untransforms, reintroducing encode/decode for bytes/str, maybe others), their intended behavior (e.g. is type(x.transform()) == type(x) always true?), and possible issues (e.g. Should some transformations be limited to str or bytes? Should rot13 work with both transform and untransform?). Even if we all agreed on a solution, such document would still be useful IMHO.
+1 for someone stepping up to write a PEP on this if they would like to see the situation improved in 3.4.
transform/untransform has at least one core developer with an explicit -1 on the proposal at the moment (me).
We *definitely* need a generic object->object convenience API in the codecs module (codecs.decode, codecs.encode). I even accept that those two functions could be worthy of elevation to be new builtin functions.
I'm *far* from convinced that awkwardly named methods that only handle str->object, bytes->object and bytearray->object are a good idea. Should memoryview gain transform/untransform methods as well?
transform/untransform as proposed aren't even inverse operations, since they don't swap the valid input and output types (that is, transform is str/bytes/bytearray to arbitrary objects, while untransform is *also* str/bytes/bytearray to arbitrary objects. Inverses can't have a domain/range mismatch like that).
Those names are also ambiguous about which one corresponds to "encoding" and which to "decoding". encode() and decode(), whether as functions in the codecs module or as builtins, have no such issue.
Personally, the more I think about it, the more I'm in favour of adding encode and decode as builtin functions for 3.4. If you want arbitrary object->object conversions, use the builtins, if you want strict str->bytes or bytes/bytearray->str use the methods. Python 3 has been around long enough now, and Python 3.2 and 3.3 are sufficiently well known that I think we can add the full power builtins without people getting confused.
I was visualizing transform/untransform as being restricted to buffertype->bytes and stringtype->string, which at least for binascii-type transforms is all the modules support. After all, you don't get to choose what type of object you get back from encode or decode.
A more generalized transformation (encode/decode) utility is also interesting, but how many non-string non-bytes transformations do we actually support?
If transform is a method, how do you plan to accept arbitrary buffer supporting types as input?
This is why I mentioned memoryview: it doesn't provide decode(), but there's no good reason you should have to copy the data from the view before decoding it. Similarly, you shouldn't have to make an unaltered copy before creating a compressed (or decompressed) copy.
With codecs.encode and codecs.decode as functions, supporting memoryview as an input for bytes->str decoding, binary->bytes encoding (e.g. gzip compression) and binary->bytes decoding (e.g. gzip decompression) is trivial. Ditto for array.array and anything else that supports the buffer protocol.
With transform/untransform as methods? No such luck.
And once you're using functions rather than methods, it's best to define the API as object -> object, and leave any type constraints up to the individual codecs (with the error handling improved to provide more context and a more meaningful exception type, as I described earlier in the thread)
I agree with you. transform/untransform are parallel to encode/decode, and I wouldn't expect them to exist on any type that didn't support either encode or decode. They are convenience methods, just as encode/decode are.
I am also probably not invested enough in it to write the PEP :)
str.decode() and bytes.encode() are not coming back.
Any proposal had better take into account the API design rule that the *type of a method's return value should not depend on the *value of one of the arguments. (The Python 2 design failed this test, and that's why we changed it.)
It is however fine to let the return type depend on one of the argument *types*. So e.g. bytes.transform(enc) -> bytes and str.transform(enc) -> str are fine. And so are e.g. transform(bytes, enc) -> bytes and transform(str, enc) -> str. But a transform() taking bytes that can return either str or bytes depending on the encoding name would be a problem.
Personally I don't think transformations are so important or ubiquitous so as to deserve being made new bytes/str methods. I'd be happy with a convenience function, for example transform(input, codecname), that would have to be imported from somewhere (maybe the codecs module).
My guess is that in almost all cases where people are demanding to say e.g.
x = y.transform('rot13')
the codec name is a fixed literal, and they are really after minimizing the number of imports. Personally, disregarding the extra import line, I think
x = rot13.transform(y)
looks better though. Such custom APIs also give the API designer (of the transformation) more freedom to take additional optional parameters affecting the transformation, offer a set of variants, or a richer API.
FWIW, I'm not interested in seeing this added anymore.
consensus here appears to be "bad idea... don't do this."
No, transform/untransform as methods are a bad idea, but these *codecs* should definitely come back.
The minimal change needed for that to be feasible is to give errors raised during encoding and decoding more context information (at least the codec name and error mode, and switching to the right kind of error).
MAL also stated on python-dev that codecs.encode and codecs.decode already exist, so it should just be a matter of documenting them properly.
okay, but i don't personally find any of these to be good ideas as "codecs" given they don't have anything to do with translating between bytes\<->unicode.
The codecs module is generic, text encodings are just the most common use case (hence the associated method API).
I don't see any point in merely bringing the codecs back, without any convenience API to use them. If I need to do
import codecs
result = codecs.getencoder("base64").encode(data)
I don't think people would actually prefer this over
import base64
result = base64.encodebytes(data)
I't (IMO) only the convenience method (.encode) that made people love these codecs.
IMHO it's also a documentation problem. Once people figure out that they can't use encode/decode anymore, it's not immediately clear what they should do instead. By reading the codecs docs0 it's not obvious that it can be done with codecs.getencoder("...").encode/decode, so people waste time finding a solution, get annoyed, and blame Python 3 because it removed a simple way to use these codecs without making clear what should be used instead. FWIW I don't care about having to do an extra import, but indeed something simpler than codecs.getencoder("...").encode/decode would be nice.
It turns out MAL added the convenience API I'm looking for back in 2004, it just didn't get documented, and is hidden behind the "from _codecs import *" call in the codecs.py source code:
http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598
So, all the way from 2.4 to 2.7 you can write:
from codecs import encode
result = encode(data, "base64")
It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:
>>> encode(b"example", "base64_codec")
b'ZXhhbXBsZQ==\n'
>>> decode(b"ZXhhbXBsZQ==\n", "base64_codec")
b'example'
Note that the convenience functions omit the extra checks that are part of the methods (although I admit the specific error here is rather quirky):
>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.2/encodings/base64_codec.py", line 20, in base64_decode
return (base64.decodebytes(input), len(input))
File "/usr/lib64/python3.2/base64.py", line 359, in decodebytes
raise TypeError("expected bytes, not %s" % s.__class__.__name__)
TypeError: expected bytes, not memoryview
I'me going to create some additional issues, so this one can return to just being about restoring the missing aliases.
Just copying some details here about codecs.encode() and codec.decode() from python-dev:
""" Just as reminder: we have the general purpose encode()/decode() functions in the codecs module:
import codecs
r13 = codecs.encode('hello world', 'rot-13')
These interface directly to the codec interfaces, without enforcing type restrictions. The codec defines the supported input and output types. """
As Nick found, these aren't documented, which is a documentation bug (I probably forgot to add documentation back then). They have been in Python since 2004:
http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598
These API are nice for general purpose codec work and that's why I added them back in 2004.
For the codecs in question, it would still be nice to have a more direct way to access them via methods on the types that you typically use them with.
It works in 3.x as well, you just need to add the "_codec" to the end to account for the missing aliases:
FTR this is because of ff1261a14573 (see bpo-10807).
bpo-17827 covers adding documentation for codecs.encode and codecs.decode
bpo-17828 covers adding exception handling improvements for all encoding and decoding operations
For me, the killer argument *against* a method based API is memoryview (and, equivalently, array.array). It should be possible to use those as inputs for the bytes->bytes codecs, and once you endorse codecs.encode and codecs.decode for that use case, it's hard to justify adding more exclusive methods to the already broad bytes and bytearray APIs (particularly given the problems with conveying direction of conversion unambiguously).
By contrast, I think "the codecs functions are generic while the str, bytes and bytearray methods are specific to text encodings" is something we can explain fairly easily, thus allowing the aliases mentioned in this issue to be restored for use with the codecs module functions. To avoid reintroducing the quirky errors described in bpo-10807, the encoding and decoding error messages should first be improved as discussed in bpo-17828.
Also adding 17839 as a dependency, since part of the reason the base64 errors in particular are so cryptic is because the base64 module doesn't accept arbitrary PEP-3118 compliant objects as input.
I also created bpo-17841 to cover that that the 3.3 documentation incorrectly states that these aliases still exist, even though they were removed before 3.2 was released.
With bpo-17839 fixed, the error from invoking the base64 codec through the method API is now substantially more sensible:
>>> b"ZXhhbXBsZQ==\n".decode("base64_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoder did not return a str object (type=bytes)
I just wanted to note something I realised in chatting to Armin Ronacher recently: in both Python 2.x and 3.x, the encode/decode method APIs are constrained by the text model, it's just that in 2.x that model was effectively basestring\<->basestring, and thus still covered every codec in the standard library. This greatly limited the use cases for the codecs.encode/decode convenience functions, which is why the fact they were undocumented went unnoticed.
In 3.x, the changed text model meant the method API become limited to the Unicode codecs, making the function based API more important.
For anyone interested, I have a patch up on bpo-17828 that produces the following output for various codec usage errors:
>>> import codecs
>>> codecs.encode(b"hello", "bz2_codec").decode("bz2_codec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bz2_codec' decoder returned 'bytes' instead of 'str'; use codecs.decode to decode to arbitrary types
>>> "hello".encode("bz2_codec")
TypeError: 'str' does not support the buffer interface
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid input type for 'bz2_codec' codec (TypeError: 'str' does not support the buffer interface)
>>> "hello".encode("rot_13")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'rot_13' encoder returned 'str' instead of 'bytes'; use codecs.encode to encode to arbitrary types
Providing the 2to3 fixers in bpo-17823 now depends on this issue rather than the other way around (since not having to translate the names simplifies the fixer a bit).
bpo-17823 is now closed, but not because it has been implemented. It turns out that the data driven nature of the incompatibility means it isn't really amenable to being detected and fixed automatically via 2to3.
bpo-19543 is a replacement proposal for the introduction of some additional codec related Py3k warnings in Python 2.7.7.
Attached patch restores the aliases for the binary and text transforms, adds a test to ensure they exist and restores the "Aliases" column to the relevant tables in the documentation. It also updates the relevant section in the What's New document.
I also tweaked the wording in the docs to use the phrases "binary transform" and "text transform" for the affected tables and version added/changed notices.
Given the discussions on python-dev, the main condition that needs to be met before I commit this is for Victor to change his current -1 to a -0 or higher.
Victor is still -1, so to Python 3.5 it goes.
The 3.4 portion of bpo-19619 has been addressed, so removing it as a dependency again.
With bpo-19619 resolved for Python 3.4 (the issue itself remains open awaiting a backport to 3.3), Victor has softened his stance on this topic and given the go ahead to restore the codec aliases: http://bugs.python.org/issue19619#msg203897
I'll be committing this shortly, after adjusting the patch to account for the bpo-19619 changes to the tests and What's New.
New changeset 5e960d2c2156 by Nick Coghlan in branch 'default': Close bpo-7475: Restore binary & text transform codecs http://hg.python.org/cpython/rev/5e960d2c2156
Note that I still plan to do a documentation-only PEP for 3.4, proposing some adjustments to the way the codecs module is documented, making binary and test transform defined terms in the glossary, etc.
I'll probably aim for beta 2 for that.
Docstrings for new codecs mention bytes.transform() and bytes.untransform() which are nonexistent.
New changeset d7950e916f20 by R David Murray in branch '3.3': bpo-7475: Remove references to '.transform' from transform codec docstrings. http://hg.python.org/cpython/rev/d7950e916f20
New changeset 83d54ab5c696 by R David Murray in branch 'default': Merge bpo-7475: Remove references to '.transform' from transform codec docstrings. http://hg.python.org/cpython/rev/83d54ab5c696
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/ncoghlan' closed_at =
created_at =
labels = ['type-feature', 'library', 'expert-unicode']
title = 'codecs missing: base64 bz2 hex zlib hex_codec ...'
updated_at =
user = 'https://github.com/florentx'
```
bugs.python.org fields:
```python
activity =
actor = 'python-dev'
assignee = 'ncoghlan'
closed = True
closed_date =
closer = 'python-dev'
components = ['Library (Lib)', 'Unicode']
creation =
creator = 'flox'
dependencies = ['17828', '17839', '17844']
files = ['15523', '15526', '32663']
hgrepos = []
issue_num = 7475
keywords = ['patch']
message_count = 95.0
messages = ['96218', '96223', '96226', '96227', '96228', '96232', '96236', '96237', '96240', '96242', '96243', '96251', '96253', '96265', '96277', '96295', '96296', '96301', '96374', '96632', '106669', '106670', '106674', '107057', '107794', '109872', '109876', '109879', '109894', '109904', '109905', '123090', '123154', '123206', '123435', '123436', '123462', '123693', '125073', '145246', '145656', '145693', '145897', '145900', '145979', '145980', '145982', '145986', '145991', '145998', '149439', '153304', '153317', '164224', '164226', '164237', '165435', '170414', '187630', '187631', '187634', '187636', '187638', '187644', '187649', '187651', '187652', '187653', '187660', '187668', '187670', '187673', '187676', '187695', '187696', '187698', '187701', '187702', '187705', '187707', '187764', '187770', '198845', '198846', '202130', '202264', '202515', '203124', '203378', '203751', '203936', '203942', '203944', '207283', '213502']
nosy_count = 22.0
nosy_names = ['lemburg', 'loewis', 'barry', 'georg.brandl', 'gregory.p.smith', 'jcea', 'cben', 'ncoghlan', 'belopolsky', 'vstinner', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'ssbarnea', 'flox', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka', 'pconnell', 'isoschiz']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue7475'
versions = ['Python 3.4']
```