Closed benjaminp closed 10 years ago
Antoine Pitrou added the comment: The fact that "there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes" is *precisely* a good reason for this to be discussed more visibly.
I don't think anyone is opposing discussing it. I don't personally think such a discussion would be useful, lots of points of view are represented on this ticket, but please feel free to raise it in whatever forum that you feel would be helpful. (Even if I did object to that I don't see how I could stop you :)).
I'm not sure what the "general case" is.
The "general case" that I'm referring to is the case of an application writing some protocol logic in terms of constructing some bytes objects and passing them to Twisted. In other words, Twisted relied upon Python to provide a convenient way to assemble your bytes into protocol messages, and that was removed in 3.x. We never provided one ourselves and I don't think it would be a particularly good idea to build that kind of basic string-manipulation functionality into Twisted rather than Python.
What I know from Twisted is there are many specific cases where, indeed, binary protocol strings are formed by string formatting, e.g. in the FTP implementation (and for good reason since those protocols are either ASCII or an ASCII superset).
These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor are they an "ASCII superset"); they are ASCII commands interspersed with binary data. It makes sense to treat them as bytes, not text. In many cases - such as when expressing a length, or a checksum - you _must_ treat them as bytes, or you will emit incorrect data on the wire. By the time you're dealing with text - if you ever are - you're already somewhere in the body of the protocol, decorated with appropriate metadata.
But my point about the "general case" is that when implementing a *new* protocol with ASCII commands, or maintaining an existing one, bytes-object formatting is a convenient, expressive and performant way to express the interpolation of values in the protocol stream.
As a workaround, it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.
Protocols like SMTP (c.f. "8-bit MIME") and HTTP put binary data in-line; do you suggest that gzipped content be encoded as latin1 so it can squeeze into python 3's str type? I thought the whole point of the porting pain here was to get a clean separation between bytes and text. This is exactly why I do not particularly want bytes.format() to allow the presence of strs as formatted values, although that *would* make porting certain things easier. It makes sense to do your encoding first, then interpolate.
Code running on both 2.x and 3.x will *by construction* have some performance pessimizations inside it. It is inherent to that strategy. Not saying this is necessarily a problem, but you should be aware of it.
This is certainly true *now*, but it doesn't necessarily have to be. Enhancements like this one could make this performance division go away. In any case, the reason that ported code suffers from a performance penalty is because python 3 has no efficient way of doing this type of bytes construction; even disregarding compatibility with a 2.x codebase, b''.join() and b'' + b'' and (''.format()).encode('charmap') are all slower _and_ more awkward than simply b''.format() or b''%.
On Jan 22, 2013, at 3:34 PM, Terry J. Reedy \report@bugs.python.org\ wrote:
I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:
I'm glad that this operation has been optimized, but treating blocks of protocol data as text is a hackish workaround that still doesn't perform as well (even on 3.3+) as bytes formatting in 2.7.
[If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]
Yes, speed is really an issue - this kind of message construction is on the critical path of many of the more popular protocols implemented with Twisted. But trying to work around the performance issue by pretending that strings are bytes will just give new life to old bugs. We've been loudly rejecting unicode from sockets I think for as long as Python has had unicode, and that's the way it should remain.
Le mardi 22 janvier 2013 à 23:34 +0000, Terry J. Reedy a écrit :
Terry J. Reedy added the comment:
>it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.
I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-.
So what? We're discussing a feature that, at best, will be present in 3.4 and not before.
> What I know from Twisted is there are many specific cases where, indeed, > binary protocol strings are formed by string formatting, e.g. in the FTP > implementation (and for good reason since those protocols are either ASCII > or an ASCII superset).
These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor are they an "ASCII superset"); they are ASCII commands interspersed with binary data.
The "ASCII superset commands" part is clearly separated from the "binary data" part. Your own LineReceiver is able to switch between "raw mode" and "line mode"; one is text and the other is binary.
In many cases - such as when expressing a length, or a checksum - you _must_ treat them as bytes, or you will emit incorrect data on the wire.
This is a non-sequitur. You can fully well take the len() of some *binary data, format it using "%d" in a *string Content-Length header, then encode the headers using utf-8 (or whatever encoding scheme the protocol mandates). Then at the end you concatenate the encoded headers and the body. I'm sure you're already doing the moral equivalent of this, except that the encoding step is absent.
So, yes, it is reasonably possible, and it even makes sense.
This is exactly why I do not particularly want bytes.format() to allow the presence of strs as formatted values, although that *would* make porting certain things easier.
At this point, I would remind you that I'm not againt bytes.format(), but I'd like it to be discussed in the open rather on the bug tracker.
And, yes, starting that discusssion is, IMO, the proponents' job :-)
even disregarding compatibility with a 2.x codebase, b''.join() and b'' + b'' and (''.format()).encode('charmap') are all slower _and_ more awkward than simply b''.format() or b''%.
How can existing constructions be slower than non-existing constructions that don't have performance numbers at all?
Besides, if b''.join() is too slow, it deserves to be improved. Or perhaps you should try bytearray instead, or even io.BytesIO.
After re-reading everything, I have somewhat changed my mind on this proposal. Perhaps 3.0 threw out too much, making it overly difficult to do some things that were to easy in 2.x and to write cross-version code.
String formatting converts all arguments to strings, using str as the default converter, but gives particular attention to formatting ints and floats. It then interpolates the resulting strings into the template string. Until msg180430, posted just half a day ago, I did not see a coherent idea of what bytes.format should be. The main problem is that there is no general bytes converter equivalent to str. I believe this is the core reason bytes.format was eliminated in 3.0.
Much of the discussion here and elsewhere has been about str.format + additions, where the additions would accommodate various possible conversions. But I now see that this was trying to do too much. Guido's subset proposal cuts this all out by proposing to only convert ints and floats as done in 2.x. So bytes.format would only convert ints and floats and otherwise would interpolate bytes into a bytes template. This should cover a large fraction of use cases. The user would be responsible for converting anything else, or converting ints and floats otherwise, with explicit calls to bytes, str.encode, struct.pack, or custom functions*..
I believe only two changes are needed to the specification of str.format, other than the obvious things like prefixing strings with 'b' and changing 'fill character' to 'fill byte'. Since general conversion would not be be done, the '! conversion' field would be eliminated. In the format specifier, the default 's' would mean that the corresponding argument must be a bytes objects, rather than any object converted by str.
# possible portability function for 'other' classes:
if py2: strb = str
else:
def strb(ob): return str(ob).encode()
I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)
I don't believe it either. I find join consistently faster than format:
python2.7 -m timeit -s 'x = [b"x"*1000']*10 'b"".join(x)' 1000000 loops, best of 3: 0.686 usec per loop
python2.7 -m timeit -s 'x = b"x"*1000' '(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 2.37 usec per loop
Try longer strings, same results (though less pronounced):
python2.7 -m timeit -s 'x = [b"x"*10000']*10 'b"".join(x)' 100000 loops, best of 3: 3.54 usec per loop
python2.7 -m timeit -s 'x = b"x"*10000' '(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 7.35 usec per loop
I'm guessing the advantage of format() is that it allows the occasional formatting of a float or int.
And % is not significantly faster:
python2.7 -m timeit -s 'x = b"x"*1000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 2.31 usec per loop
python2.7 -m timeit -s 'x = b"x"*10000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 6.81 usec per loop
python2.7 -m timeit -s 'x = b"x"*100000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 1000 loops, best of 3: 565 usec per loop
I think ''.join() will always be faster than ''.format(), for a number of reasons (some already stated):
Whether b''.format() would have to lookup and call __format__ remains to be seen. From what I've read, maybe baking in knowledge of bytes, float, and int would be good enough. I suspect there might be some need for datetimes, but I could be wrong.
The above said, code using b''.format() would definitely be easier to write and understand that a lot of individual field formatting followed by a .join().
Whether b''.format() would have to lookup and call __format__ remains to be seen. From what I've read, maybe baking in knowledge of bytes, float, and int would be good enough. I suspect there might be some need for datetimes, but I could be wrong.
The __bytes method (and/or tpbuffer) may be a better discriminator than \_format. It would also allow combining arbitrary buffer objects without making tons of copies. What it also means is that "format()" may not be the best method name for this. It is less about formatting than about combining.
Also, it's not obvious what "formatting" a number as bytes should do. Should it mimick the bytes constructor:
>>> bytes(5)
b'\x00\x00\x00\x00\x00'
Should it mimick the int to_bytes() method:
>>> (5).to_bytes(4, 'little')
b'\x05\x00\x00\x00'
Numbers currently don't have a __bytes__ method:
>>> (5).__bytes__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '__bytes__'
I retract the datetime comment. Given what we're trying to accomplish, I think we only need to support types that are supported by 2.7's %-formatting.
Remember, the only reason to add this would be to enable writing code that works in both 2.7 and 3.4. So it has to be called .format() and it has to format numbers as decimal strings by default.
On Jan 22, 2013, at 11:27 PM, Antoine Pitrou \report@bugs.python.org\ wrote:
Antoine Pitrou added the comment:
The "ASCII superset commands" part is clearly separated from the "binary data" part. Your own LineReceiver is able to switch between "raw mode" and "line mode"; one is text and the other is binary.
This is incorrect. "Lines" are just CRLF (0x0D0A) separated chunks of data. For example, SMTP is always in line-mode, but messages ("data lines") may contain arbitrary 8-bit data.
This is a non-sequitur. You can fully well (...) So, yes, it is reasonably possible, and it even makes sense.
I concede it is possible to implement what you're talking about, but it still requires encoding things which are potentially 8-bit data. Yes, there are many corners of protocols where said data looks like text, but it is an optical illusion.
> even disregarding compatibility with a 2.x codebase, b''.join() and > b'' + b'' and (''.format()).encode('charmap') are all slower _and_ > more awkward than simply b''.format() or b''%.
How can existing constructions be slower than non-existing constructions that don't have performance numbers at all?
Sorry, "in 2.x" :).
Besides, if b''.join() is too slow, it deserves to be improved. Or perhaps you should try bytearray instead, or even io.BytesIO.
As others have noted, b''.join is *not* slower than b''.format for simply assembling strings; b''.join is indeed faster at that and I didn't mean to say it wasn't. The performance improvement shows up when you are assembling complex messages that contain a smattering of ints, floats, and other chunks of bytes; mostly in that you can avoid a bunch of python code execution and python function calls when formatting those values. The trouble with cooking up an example of this is that it starts to involve a bunch of additional code complexity and it requires careful framing to make sure the other complexity isn't what's getting in the way. I will try to come up with one, maybe doing so will prove even this contention wrong.
But, the main issue here is expressiveness, not performance.
On Jan 22, 2013, at 11:31 PM, Martin v. Löwis \report@bugs.python.org\ wrote:
I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)
You're correct about this; see my previous comment.
On Jan 23, 2013, at 1:58 AM, Antoine Pitrou \report@bugs.python.org\ wrote:
> Numbers currently don't have a __bytes__ method:
>
>>>> (5).__bytes__()
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AttributeError: 'int' object has no attribute '__bytes__'
They do have some rather odd behavior when passed to the builtin though:
>>> bytes(10)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior. (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)
They do have some rather odd behavior when passed to the builtin though:
>>> bytes(10) b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior. (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)
I would agree with you, but it's probably too late to change...
On Jan 23, 2013, at 11:02 AM, Antoine Pitrou \report@bugs.python.org\ wrote:
I would agree with you, but it's probably too late to change...
Understandable, and, in any case, out of scope for this ticket.
So it sounds like the use case is (as Glyph said in msg180432):
In that case the only options I see are to implement __mod or .format for bytes in 3.4. I'd of course prefer to use .format, although __mod would probably make the transition easier (no need to move to .format first). It would probably also make the implementation easier, since there's so much less code in str.__mod__. But let's assume we're using .format [1].
Given the restricted use case, and assuming we using .format, the implementation would not need to support:
But it would support all of the specifiers for formatting strs (except now for bytes), floats, and ints.
I haven't looked through the str.format or {str,int,float}.__format__ code since the PEP-393 work, so I'm not really sure if we could stringlib-ify the code again, or if it would just be easier to reimplement it as separate bytes-only code.
[1] It's open for debate whether .format or .__mod__ is preferable. [2] Since %-formatting supports %r and %s, this point is arguable.
I'd like to put a nudge towards supporting the __mod__ interface on bytes - for Mercurial this is the single biggest impediment to even getting our testrunner working, much less starting the porting process.
I'd like to put a nudge towards supporting the __mod__ interface on bytes - for Mercurial this is the single biggest impediment to even getting our testrunner working, much less starting the porting process.
Given a spec hasn't been written (bytes.__mod can't support the same things as str.__mod), and nobody seems to step up to write it, I'd say this is unlikely to appear in 3.4.
Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.
Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.
If you could write up a concrete proposal, including which format specifiers would be supported, that would be helpful.
Would it be extensible with something like __bformat__?
There's really quite a bit of work to be done to specify how this would work.
Also, with the PEP-393 changes, the implementation will be much more difficult. Sharing code with str (unicode) will likely be impossible, or require much refactoring of the existing code.
Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.
You mean "%s" and "%d"?
Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.
Or write a pure Python implementation.
On Tue, Oct 8, 2013 at 11:08 AM, Antoine Pitrou \report@bugs.python.org\wrote:
> Is there any chance we could just have it work for bytes, ints, and > floats? That'd solve the immediate need, and it'd be obviously > correct how to have those behave.
You mean "%s" and "%d"?
Basically, yes.
> Punting this to 3.5 basically means we'll have to either wait for > 3.5, or do something awful like use cffi to grab sprintf to port > Mercurial.
Or write a pure Python implementation.
Hah. Probably too slow for anything beyond a proof of concept, no?
On Oct 8, 2013, at 8:10 AM, Augie Fackler \report@bugs.python.org\ wrote:
Hah. Probably too slow for anything beyond a proof of concept, no?
It should perform acceptably on PyPy ;-).
> > Punting this to 3.5 basically means we'll have to either wait for > > 3.5, or do something awful like use cffi to grab sprintf to port > > Mercurial. > > Or write a pure Python implementation.
Hah. Probably too slow for anything beyond a proof of concept, no?
If it's only for the Mercurial test suite, that shouldn't be a problem?
On Tue, Oct 8, 2013 at 5:11 PM, Antoine Pitrou \report@bugs.python.org\wrote:
Antoine Pitrou added the comment:
> > > Punting this to 3.5 basically means we'll have to either wait for > > > 3.5, or do something awful like use cffi to grab sprintf to port > > > Mercurial. > > > > Or write a pure Python implementation. > > Hah. Probably too slow for anything beyond a proof of concept, no?
If it's only for the Mercurial test suite, that shouldn't be a problem?
It's not just the testsuite though: we do this _all over_ hg itself. For example, status needs to do something like this:
sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': 'some/filesystem/path'})
except we don't know the encoding of the filesystem path (Hi unix!) so we have to treat the whole thing as opaque bytes. It's even more fun for 'log', becase then it's got localized strings in it as well.
2013/10/8 Augie Fackler \report@bugs.python.org\:
sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': 'some/filesystem/path'})
except we don't know the encoding of the filesystem path (Hi unix!) so we have to treat the whole thing as opaque bytes.
You are doing it wrong. In Python 3, you "should" store filenames as Unicode (str type). If Python fails to decode a filename, undecodable bytes are stored as surrogate characters (see the PEP-383).
The Unicode type became natural in Python 3, as byte string (old "str" type) was natural in Python 2.
sys.stdout.write() expects a Unicode string, not a byte string.
Does it mean that Mercurial is moving to Python 3? Cool :-)
I've lost track what we were talking about. I thought we were trying to support b'\<something>'.format() in 3.4, for a restricted set of arguments.
I don't see how a third-party package is going to help, if the goal is to allow 3.4 to be source compatible with 2.7. And the recent example uses %-formatting, which is not the subject of this ticket.
What proposal is actually on the table here?
On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:
What proposal is actually on the table here?
Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.
On Oct 8, 2013, at 5:24 PM, STINNER Victor \report@bugs.python.org\ wrote:
STINNER Victor added the comment:
2013/10/8 Augie Fackler \report@bugs.python.org\: > sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': > 'some/filesystem/path'}) > > except we don't know the encoding of the filesystem path (Hi unix!) so we > have to treat the whole thing as opaque bytes.
You are doing it wrong. In Python 3, you "should" store filenames as Unicode (str type). If Python fails to decode a filename, undecodable bytes are stored as surrogate characters (see the PEP-383).
No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way. We're not able to change either our on-disk data format OR our stdout format, even to support a newer version of Python. I don't know the encoding of the filename's bytes, but I _must_ faithfully reproduce them exactly as they are or I'll break tools like make(1) and patch(1). Similarly, if a file goes from ISO-8859-1 to UTF-8, I have to emit a diff that has some ISO bytes and some UTF bytes - it's not in *any* valid encoding. Changing that is a showstopper regression.
The Unicode type became natural in Python 3, as byte string (old "str" type) was natural in Python 2.
sys.stdout.write() expects a Unicode string, not a byte string.
Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?
Does it mean that Mercurial is moving to Python 3? Cool :-)
Not likely, honestly. I tackle this when I've got some spare cycles and my ability to handle pain is high. As it stands, I have the test-runner barely working, but it's making wrong assumptions to get there. The best estimate is that it's a year of work to upgrade to Python 3.
----------
Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue3982\
On Oct 8, 2013, at 6:19 PM, Glyph Lefkowitz \report@bugs.python.org\ wrote:
Glyph Lefkowitz added the comment:
On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:
> What proposal is actually on the table here?
Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.
Yeah, my bad too. I suppose I should add a new bug for %-formatting on bytes objects?
Note that for hg, we can't drop Python 2.6 or so (we'll only drop *2.4* if we can do 2.6 and some 3.x from a single source tree) for a while, due to supporting the system interpreter on a variety of LTS platforms.
Augie, to understand what Viktor meant, I suggest reading http://www.python.org/dev/peps/pep-0383/ One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.
On Oct 8, 2013, at 6:28 PM, "Terry J. Reedy" \report@bugs.python.org\ wrote:
http://www.python.org/dev/peps/pep-0383/ One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.
At a quick skim, likely not good enough, because http://en.wikipedia.org/wiki/Shift_JIS isn't completely ASCII-compatible, and we've got a fair number of users on weird Shift-JIS using platforms.
On Oct 8, 2013, at 3:19 PM, Augie Fackler wrote:
No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way.
The PEP-383 technique for handling file names is completely capable of round-tripping exact bytes, given one encoding for both input and output. You can still handle file names this way internally in Mercurial and not risk disturbing any observable output. You do not need to change that in order to do what Victor suggests.
We should get together in some other forum and discuss file-name handling though, since you can't actually round-trip "opaque bytes" through a *filesystem* and not disturb your output.
Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?
You can use sys.stdout.buffer.write.
Here is a proof of concept Python function, with a minimal test. It is similar to how str.format could be coded in Python, with re.split and ''.join, except that it does not allow anything before : in the format specification. By default (no format spec given), it copies bytes objects without change. If a format specification *is* given, it does not restrict the object, as this code simply uses builtin format sandwiched between decode and encode.
You can use sys.stdout.buffer.write.
Note that there's no guarantee that sys.stdout.buffer exists, e.g. if sys.stdout has been replaced with a StringIO.
Tempting as it is to reply to the comment about 'buffer' not existing, we're way off topic here. Let's please keep further comments on this bug to issues about a 'format' methods on the 'bytes' object.
First off, +1 for this feature. It's not just for twisted, but anyone doing anything with binary data (storage, compression, encryption and networking for me) with python since 2.6 will very likely have been using .format for building messages. I know I have and obviously others have been doing so as well.
The advantages of .format to me are:
Specific comments on the patch supplied by terry.reedy:
Really this isn't a good way to solve the problem.
Has a PEP been created for this? If not how can I help make that happen?
Including this in 3.5 would be so helpful for us low level systems programmers out there who have lots of code using .format for binary interfaces in python 2.6/2.7 already.
Also, not to add to derailment, but if we're adding a .format for python3 bytes it would be great if .format could pad with the null byte ('\0') which it currently converts to spaces internally (which is strange). Since this unexpected conversion is bad (so padding with null doesn't happen in python2) its more like a bug fix... actually - maybe that's a separate bug to file on the current .format for text...
sorry, terry's patch does handle padding - just with the caveats i listed later. i should have removed that bullet.
http://legacy.python.org/dev/peps/pep-0461/ adds % formatting for bytes and bytes array.
Nick, I have the impression that there was a decision to not add bytes.format. Correct? If so, this issue should be closed. If not, what, if anything, has been decided?
Right, bytes.format was considered as part of the PEP-461 discussions, and rejected as an operation that only made sense in the text domain: http://www.python.org/dev/peps/pep-0461/#proposed-variations
With PEP-461 accepted, and PEP-460 withdrawn, that means we won't be adding bytes.format and bytearray.format.
bpo-20284 covers the implementation of PEP-461.
This came up in the language summit today when discussing twisted. .format() is still not supported on bytes though % is in 3.5.
realistically it sounded like twisted needs to support python 3.4 for many years so they can't rely on bytes having a .format() method that also works on 2.7 anyways... but assuming .format() is only useful for text may still have been an oversight. (i'll have to go re-read PEP-460 and 461 and discussion before commenting further)
Gregory - I'm glad that you're willing to consider this again. It still is a constant issue for me, and .format with variable width fields in binary protocols is so the right tool for the job. If there is anything I can do to help get this added to 3.6 let me know. The forward/backward compatibility issue is secondary to me to the flexibility gained from having .format available for bytes.
Also padding with null bytes that don't get converted would be awesome.
The core problem with the idea of adding bytes.format to Python 3 is that the real power of str.format actually lies in the extensible __format__ protocol and the associated format() builtin, as those rely heavily on text-specific assumptions.
I interpreted Amber's comments at the language summit as referring more to our changing tune regarding mod formatting from:
Folks that followed our original "stop using mod formatting" guidance thus needed to change course when it became our recommended technique for formatting binary data.
Since we now know format() and __format__ aren't suitable for binary data (PEP-361 originally included it, and it got dropped as we kept finding awkward corner cases), that means any new binary formatting proposal needs to explain:
I could have sworn that bytes.format
had been implemented. When I needed it once, I came to the realization that this method never existed in Python 3.0, but it did in Python 2.7.
I also remember that bytes.format
triggered an error if the input was of data type str
.
Who else has this false memory? Is this the Mandela effect?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['interpreter-core', 'type-feature']
title = 'support .format for bytes'
updated_at =
user = 'https://github.com/benjaminp'
```
bugs.python.org fields:
```python
activity =
actor = 'ncoghlan'
assignee = 'none'
closed = True
closed_date =
closer = 'ncoghlan'
components = ['Interpreter Core']
creation =
creator = 'benjamin.peterson'
dependencies = []
files = ['32009']
hgrepos = []
issue_num = 3982
keywords = []
message_count = 95.0
messages = ['73931', '73935', '73936', '73937', '73938', '73939', '74019', '74021', '74022', '74050', '84121', '84123', '90421', '90423', '90425', '90428', '127210', '130215', '130253', '130284', '163369', '163379', '171791', '171795', '171796', '171799', '171800', '171801', '171803', '171804', '171806', '171815', '171816', '171821', '171824', '180414', '180415', '180416', '180419', '180420', '180423', '180426', '180427', '180430', '180431', '180432', '180433', '180436', '180437', '180439', '180441', '180442', '180445', '180446', '180447', '180448', '180449', '180452', '180453', '180454', '180466', '180489', '180490', '180491', '180492', '180493', '180500', '198112', '199181', '199199', '199203', '199204', '199206', '199207', '199251', '199253', '199254', '199258', '199260', '199264', '199265', '199266', '199267', '199268', '199270', '199271', '199432', '199438', '223976', '223979', '224022', '224023', '266568', '268157', '268160']
nosy_count = 26.0
nosy_names = ['loewis', 'barry', 'brett.cannon', 'terry.reedy', 'gregory.p.smith', 'exarkun', 'ncoghlan', 'pitrou', 'vstinner', 'eric.smith', 'christian.heimes', 'benjamin.peterson', 'glyph', 'ezio.melotti', 'durin42', 'Arfrever', 'arjennienhuis', 'flox', 'ecir.hana', 'uau', 'tshepang', 'underrun', 'martin.panter', 'serhiy.storchaka', 'nlevitt@gmail.com', 'stendec']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue3982'
versions = ['Python 3.5']
```