Closed benjaminp closed 10 years ago
I just working on porting some networking code from 2.x to 3.x and it heavily uses string formatting. Since bytes don't support any kind of formatting, it's becoming tedious and inelegant to do it with "+". Can .format be supported in bytes?
[I understand format is implemented with stringlib so shouldn't it be fairly easy to implement?]
Yes, it would be easy to add. Maybe bring this up on python-dev (or python-3000) to get consensus?
Are we in feature freeze for 3.0?
On Sat, Sep 27, 2008 at 12:33 PM, Eric Smith \report@bugs.python.org\ wrote:
Eric Smith \eric@trueblade.com\ added the comment:
Yes, it would be easy to add. Maybe bring this up on python-dev (or python-3000) to get consensus?
Yes, that will have to be done.
Are we in feature freeze for 3.0?
Unfortunately, yes.
I'm skeptical. What networking code specifically are you using, and what specifically does it use string formatting for?
On Sat, Sep 27, 2008 at 12:35 PM, Martin v. Löwis \report@bugs.python.org\ wrote:
Martin v. Löwis \martin@v.loewis.de\ added the comment:
I'm skeptical. What networking code specifically are you using, and what specifically does it use string formatting for?
I'm working on the tests for ftplib. [1] The dummy server uses string formatting to build responses.
[1] http://svn.python.org/view/python/trunk/Lib/test/test_ftplib.py?view=markup
I'm working on the tests for ftplib. [1] The dummy server uses string formatting to build responses.
I see. I propose to add a method push_string, defined as
def push_string(self, s):
self.push(s.encode("ascii")
In FTP, the responses are, by definition, ASCII-encoded strings. The proper way to generate them is to make a string, then encode it.
I don't think that b'...'.format() is a good idea. Programmers will continue to mix characters and bytes since .format() target are characters.
I don't think that b'...'.format() is a good idea. Programmers will continue to mix characters and bytes since .format() target are characters.
b''.format() would return bytes, not a string. This is also how it works in 2.6.
I'm also not sold on implementing it, although it would be easy and I can see a few uses for it. I think Martin's suggesting of encoding back to ascii might be the best thing to do (that is, don't implement b''.format()).
I think Martin's suggesting of encoding back to ascii might be the best thing to do
As I understand, you would like to use bytes as characters, like b'{code} {message}'.format(code=100, message='OK'). So why no using explicit conversion to ASCII? ftp='{code} {message}'.format(code=100, message='OK').encode('ASCII').
If you need to work on bytes, it means that you will use the full range 0..255 whereas ASCII reject bytes in 128..255.
> I think Martin's suggesting of encoding back to ascii might be > the best thing to do
As I understand, you would like to use bytes as characters, like b'{code} {message}'.format(code=100, message='OK'). So why no using explicit conversion to ASCII? ftp='{code} {message}'.format(code=100, message='OK').encode('ASCII').
That's indeed exactly what I had proposed - only that you shouldn't repeat the .encode('ascii') all over the place, but instead wrap that into a function (which I proposed to call push_string, along with the existing .push function.
loewis> That's indeed exactly what I had proposed loewis> - only that you shouldn't repeat the .encode('ascii') loewis> all over the place, (...)
If you can only use bytes 0..127, it can not used for binary protocols and so I don't think that it's really useful. If your protocol is ASCII text, use explicit conversion to ASCII.
I also not fan on functions having different result type (format->bytes or str, it depends...).
I also not fan on functions having different result type (format->bytes or str, it depends...).
In 3.x, str.format() and bytes.format() would be two different methods on two different objects. I don't think there's any expectation that they have the same return type. There's no such expectation for str.strip() and bytes.strip() either.
Similarly, in 2.6, str.format() has a different return type than unicode.format().
Now the builtin format() function is another issue. In 2.6 the return type does depend on the types of the arguments. In 3.x, I'd suggest leaving it as unicode and you won't be allowed to pass in bytes.
There are many binary formats that use ASCII numbers.
'HTTP chunking' uses ASCII mixed with binary (octets).
With 2.6 you could write:
def chunk(block):
return b'{0:x}\r\n{1}\r\n'.format(len(block), block)
With 3.0 you'd have to write this:
def chunk(block):
return format(len(block), 'x').encode('ascii') + b'\r\n' + block +
b'\r\n'
You cannot convert to ascii at the end of the pipeline as there are bytes > 127 in the data blocks.
def chunk(block): return format(len(block), 'x').encode('ascii') + b'\r\n' + block + b'\r\n'
You cannot convert to ascii at the end of the pipeline as there are bytes > 127 in the data blocks.
I wouldn't write it in such a complicated way. Instead, use
def chunk(block):
return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'
This doesn't need any format call, and describes adequatly how the protocol works: send an ASCII-encoded hex length, send CRLF, send the block, then send another CRLF. Of course, I would probably write that into the socket right away, rather than copying it into a different bytes object first.
def chunk(block): return hex(len(block)).encode('ascii') + b'\r\n' + block + b'\r\n'
hex(10) returns '0xa' instead of 'a'.
This doesn't need any format call, and describes adequatly how the protocol works: send an ASCII-encoded hex length, send CRLF, send the block, then send another CRLF. Of course, I would probably write that into the socket right away, rather than copying it into a different bytes object first.
The point is that need to convert to ascii for each int that you send. You cannot just wrap the socket with an encoding. This makes porting difficult.
hex(10) returns '0xa' instead of 'a'.
Ah, right. So I would still use
'{0:x}'.format(100).encode("ascii")
rather than the format builtin format function. Actually, I would probably use
('%x' % len(bytes)).encode("ascii")
The point is that need to convert to ascii for each int that you send. You cannot just wrap the socket with an encoding. This makes porting difficult.
This I don't understand. What porting becomes more difficult? From 2.x to 3.x? Why do you have any .format calls in your code that you want to port - .format was only added in 2.6, so if you want to support 2.x, you surely are not using .format, are you?
This kind of formatting is needed quite often when working on network protocols or file formats, and I think the replies here fail to address important issues. In general you can't encode after formatting, as that doesn't work with binary data, and often it's not appropriate for the low-level routines doing the formatting to know what charset the data is in even if it is text (so it should be fed in already encoded as bytes). The replies from Martin v. Löwis seem to argue that you could use methods other than formatting; that would work almost as well as an argument to remove formatting support from text strings, and IMO cases where formatting is the best option are common.
Here's an example (based on real use but simplified):
template = b"""
stuff here
header1: {}
header2: {}
more stuff
"""
def lowlevel_send(s, b1, b2): # s socket, b1 and b2 bytes
s.send(template.format(b1, b2))
To clarify the requirements a bit, the issue is not so much about having a .format method on byte string objects (that's just the most natural-looking way of solving it); the core requirement is to have a formatting operator that can take byte strings as *arguments and produce byte string *output where the arguments can be placed unchanged.
For future reference, struct.pack, not mentioned here, is a binary bytes formatting function. It can mix ascii bytes with binary octets. It works the same in Python 2 and 3.
Str.bytes does two things: convert objects to strings according to the contents of field specifiers; interpolate the resulting strings into a template string according to the locations of the field specifiers. If desired bytes represent encoded text, then encoding computed text is the obvious Py3 solution.
For some mixed ascii-binary uses, struct.pack is not as elegant as a bytes.format might be. But I think such a method should use struct format codes within field specifiers to convert objects into binary bytes rather than text.
struct.pack does not work with variable length data. Something like:
b'{0:x}\r\n{1}\r\n'.format(len(block), block)
or
b'%x\r\n%s\r\n' % (len(block), block)
is not possible with struct.pack
You are right, I misinterpreted the meaning of 's' without a count (and opened bpo-11436 to clarify). However, for the fairly common case where a variable-length binary block is preceded by a 4 byte *binary* count, one can do something which is not too bad:
>>> block = b'lsfjdlksaj'
>>> n=len(block)
>>> struct.pack('I%ds'%n, n, block)
b'\n\x00\x00\x00lsfjdlksaj'
If leading blanks are acceptable for your example with count as ascii hex digits, one can do something that I admit is worse:
>>> struct.pack('10s%ds2s'%n, ('%8x\r\n'%n).encode(), block, b'\r\n')
b' a\r\nlsfjdlksaj\r\n'
Of course, for either of these in isolation, I would probably only use .pack for the binary conversion and otherwise use '+' or b''.join(...).
I've hit this limitation a couple more times, and none of the proposed workarounds are adequate. Working with protocols and file formats that use human-readable markup is significantly clumsier than it was with Python 2 (using either the % operator, which also lost its support for byte strings in Python 3, or .format()).
This bug report was closed by its original creator, after early posts where IMO nobody made as good a case for the feature as they could have. Is it possible to reopen this bug or is it necessary to file a new one?
Is there any clear argument AGAINST having .format() for bytes, other than work needed to implement it? Some posts mention "mixing characters and bytes", but I see no reason why this would be much of a real practical concern if it's a method on bytes objects producing bytes output.
If you want to discuss this issue further, I think you post to python-ideas list with concrete examples.
Since Benjamin originally requested this feature, and then decided that he could accomplish his desired goal (ftplib porting, as far as I can tell) without it, I think that the "rejected" status is actually incorrect. I think that Benjamin just wanted to indicate that he no longer needed the feature. This doesn't mean that no one else will need the feature, and as it turns out the comments seem to reveal that other people do need the feature (also, I need the feature).
So, adjusting the ticket metadata to reflect that this is a valid feature request just waiting for someone to implement it, not a rejected idea that is not welcome in Python.
The proposal sounds like a good idea to me.
Benjamin, what needs to be done to implement the feature?
Formatting is a very complicated part of Python (especially after Victor's optimizations). I think no one wants to maintain this code for a long time. The price of maintaining exceeds the potential very limited benefits from the use.
I was just logging in to make this point, but Serhiy beat me to it. When I wrote several years ago that this was "easy", it was before the (awesome) PEP-393 work. I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.
So I think Martin's advice to just encode to ascii is the best course of action.
The price of maintaining exceeds the potential very limited benefits from the use.
The "very limited benefits" of being able to write I/O code without roughly 3 times code bloat? Perhaps for people who don't write code that does non-trivial I/O, but for the rest of us the benefits are pretty significant.
I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.
The implementation may be difficult, therefore no one should attempt it?
The implementation may be difficult, therefore no one should attempt it?
The development cost and maintenance cost is surely part of the evaluation when deciding whether to implement a feature, no?
The development cost and maintenance cost is surely part of the evaluation when deciding whether to implement a feature, no?
Sure, but in an open source project where almost all contributions are done by volunteers (ie, donated), what is the development cost?
I suspect, but have not verified, that having a bytes version of this code would now require an implementation that shared very little with the str version.
This is not all. The usage model will be completely different too.
As a result, this should be a completely separate formatting mini-language that has nothing shared with strings formatting. Not worth to introduce bytes.format(), it's just confused. Perhaps you should add features to struct module or add a new module. PyPI looks as good place for such experiments. If people will use it, it could be included in the stdlib.
As Serhiy suggests, it would be best to collect th eusecases for a format-like method for bytes and design something which can meet them. It's definitely a PEP.
In 3.3+, somestring.encode('ascii') is a small constant-time operation. So for pure ascii *text* bytes, that seems the appropriate 3.x approach.
I agree that something else should be used for binary formatting. Perhaps struct.pack could be extended to work with variable-length data the way I thought it already did. Otherwise, it already *is* the binary formatting method.
It's not constant time.
Sorry, I was thinking of something else. Encoding ascii-only text is merely much faster (3x?) than in 3.2- because it directly copies without using the codec.
Sorry, I was thinking of something else. Encoding ascii-only text is merely much faster (3x?) than in 3.2- because it directly copies without using the codec.
In 3.3 encoding to ascii or latin1 as fast as memcpy. 12-15x on my computer.
Twisted still would like to see this.
Implementing this certainly hasn't gotten any easier as 3.x str.format has evoled. The kind of format codes and modifiers wanted to for formatting byte strings might be different that those for text strings. I think it probably needs a pep.
Would it be easier if the only format codes/types supported were bytes, int and float?
IMHO a useful API has to provide a more low level functionality like "format number as 32 bit unsigned integer in network endian". A bytes.format() function should support all format chars from http://docs.python.org/3/library/struct.html#format-characters plus all endian and alignment modifiers.
The problem is not so much the types allowed the code for dealing with the format string. The parsing code for format specificers is pretty unicode specific now. If that was to be made generic again, it's worth considering exactly what features belong in a bytes format method.
Honestly, what Twisted is mostly after is a way to write code that works both with Python 2 and Python 3. They need the types I mentioned only (bytes, int, float) and not too many advanced features of .format() -- but if it's not called .format() or if the syntax is not a subset of the syntax of Python 2 format syntax, it's not very useful for them. (They would have to rewrite every protocol implementation in their tree to use something different, apparently, since .format() has proven to be the most efficient way to construct larger byte strings out of smaller pieces, in Python 2.)
Given the issues which have been brought here, I agree that it's PEP material.
Serhiy did a nice summary in msg171804, and I think this is PEP material too. What he wrote could be used as a starting point; the next step would be collecting use cases (the Twisted guys seem to have some). Once we have defined what we want we can figure out how to implement it (e.g. how much code can be shared with str.format, if it should be bytes.format or something in the struct module).
Well, msg171804 makes it a much bigger project than the feature that Twisted actually needs. Quoting:
The default formatting should not use str(), but buffer protocol. Fine.
There is no place for floating point. Actually they do need it -- and it's trivial to define, since fp only returns ASCII characters.
There is no place for locale. Agreed.
There is no place for 'r' conversion (possible only for 'a'). Agreed.
It should include the features of struct.pack(), int.to_bytes() and ctypes. Not needed.
Padding should be not only by space, but also by zeros (and possibly by other values). Not needed.
Alignment (padding to position divisible by some number). Not needed.
In addition to padding and truncating should be the ability to raise an exception in case of discrepancy between the needed and actual lengths. Not needed.
It unlikely needed attribute access and indexing. I don't know, but these features certainly would be well-defined.
Builtin format() should not work with this. Fine.
Probably bytes.format() should not try to call v.__format__(); if an extension mechanism is needed it would be called something else, but given the limited set of types needed I think this can be skipped.
The most important requirement from Twisted is actually that it is called .format(), and that the overall format strings look like they did for 8-bit string formatting in Python 2. In particular b'a{}b{}c'.format(x, y), where x and y are bytes, should be equivalent to b'a' x + b'b' + y + b'c'.
Right, but we're not writing builtin type methods specifically for Twisted. I agree with the idea that the feature set should very limited, actually perhaps more limited than what you just said. For example, I think any kind of implicit str->bytes conversion is a no-no (including the "r" and "a" format codes).
Still, IMO even a simple feature set warrants a PEP, because we want to devise something that's generally useful, not just something which makes porting easier for Twisted.
I also kind of expect Twisted to have worked around the issue before 3.4 is out, anyway.
On Jan 22, 2013, at 11:39 AM, Antoine Pitrou \report@bugs.python.org\ wrote:
Antoine Pitrou added the comment:
I agree with the idea that the feature set should very limited, actually perhaps more limited than what you just said. For example, I think any kind of implicit str->bytes conversion is a no-no (including the "r" and "a" format codes).
Twisted doesn't particularly need str->bytes conversion in this step, implicit or otherwise, so I have no problem with leaving that out.
Still, IMO even a simple feature set warrants a PEP, because we want to devise something that's generally useful, not just something which makes porting easier for Twisted.
Would it really be so bad to add features that would make porting Twisted easier? Even if you want porting Twisted to be as hard as possible, there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes. Twisted itself is a good proxy for this class of application; I really don't think that this is overly specific.
I also kind of expect Twisted to have worked around the issue before 3.4 is out, anyway.
The problem is impossible to work around in the general case. While we can come up with clever workarounds for things internal to buffering implementations or our own protocols, Twisted exposes an API that allows third parties to write protocol implementations, which quite a few people do. Every one of those implementations (and every one of Twisted's internal implementations, none of which are ported yet, just the core) faces a series of frustrating implementation choices where the "old" style of b'x' % y or b'x'.format(y) resulted in readable, efficient value interpolation into protocol messages, but the "new" style of b''.join([b'x1', y_to_bytes(y), b'x2']) requires custom functions, inefficient copying, redundant bytes\<->text transcoding, and harder-to-read protocol framing literals. This interacts even more poorly with oddities like bytes(int) returning zeroes now, so there's not even a reasonable 2\<->3 compatible way of, say, setting an HTTP content-length header; b'Content-length: {}\r\n'.format(length) is now b''.join([b'Content-length: ', (bytes if bytes is str else str)(length).encode('ascii'), b'\r\n']).
This has negative readability, performance, and convenience implications for the code running on both 2.x and 3.x and it would be really nice to see fixed. Honestly, it would still be a porting burden to have to use .format(); if you were going to do something _specifically to help Twisted, the thing to do would be to make both .format and .\_mod__ work; most of our protocol code currently uses % to do its formatting. However, upgrading to a "modern" API is not an insurmountable burden for Twisted, and I can understand the desire to trade off that work for the simplicity of having less code to maintain in Python core (and less to write for this feature), as long as the "modern" API is actually functional enough to make very common operations close to equivalently convenient.
there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes.
The fact that "there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes" is *precisely* a good reason for this to be discussed more visibly. Even if it isn't a PEP, it will still benefit from being a python-dev or python-ideas discussion. We are talking about a method on a prominent built-in type, not some additional function or method in an obscure module.
> I also kind of expect Twisted to have worked around the issue before 3.4 is out, anyway.
The problem is impossible to work around in the general case.
I'm not sure what the "general case" is. What I know from Twisted is there are many specific cases where, indeed, binary protocol strings are formed by string formatting, e.g. in the FTP implementation (and for good reason since those protocols are either ASCII or an ASCII superset). As a workaround, it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.
This has negative readability, performance, and convenience implications for the code running on both 2.x and 3.x and it would be really nice to see fixed.
Code running on both 2.x and 3.x will *by construction* have some performance pessimizations inside it. It is inherent to that strategy. Not saying this is necessarily a problem, but you should be aware of it.
Honestly, it would still be a porting burden to have to use .format(); if you were going to do something _specifically to help Twisted, the thing to do would be to make both .format and .\_mod__ work; most of our protocol code currently uses % to do its formatting.
I know that :-)
2013/1/22 Guido van Rossum \report@bugs.python.org\:
Twisted still would like to see this.
Sorry, but this argument doesn't convince me. A better argument is that bytes+bytes+...+bytes is inefficient: it creates a lot of temporary objects instead of computing the final size directly, or using realloc.
str%args and str.format() uses realloc() and overallocates its internal buffer to avoid too many calls to realloc().
On Jan 22, 2013, at 1:46 PM, STINNER Victor \report@bugs.python.org\ wrote:
2013/1/22 Guido van Rossum \report@bugs.python.org\: > Twisted still would like to see this.
Sorry, but this argument doesn't convince me. A better argument is that bytes+bytes+...+bytes is inefficient: it creates a lot of temporary objects instead of computing the final size directly, or using realloc.
Uh, yes. That's one of the reasons (given above) that Twisted would still like to see this. It seemed to me that Guido was stating a fact there, not making an argument. The Twisted project *would* like to see this, I can assure you, regardless of whether you're convinced or not :).
str%args and str.format() uses realloc() and overallocates its internal buffer to avoid too many calls to realloc().
More importantly, it's fairly easy to add many optimizations of this type to an API in the style of .format(), even if it's not present in the first round; optimizing bytes + bytes + bytes requires slightly scary interactions with refcounting and potentially GC, like the += optimization. The API just has more information to go on, and that's a good thing.
it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.
I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:
for len in (0, 100, 1000, 10000, 100000):
a = 'a' * len
print(timeit("a.encode()", "from __main__ import a"))
>>>
0.19305401378265558
0.22193721412302575
0.2783227054755883
0.677596406192696
7.124387897799184
Given n = 1000000, these should be microseconds per encoding. Of note: the copying of bytes does not double the total time until there are a few thousand chars. Would protocols be using .format for much more than this?
[If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['interpreter-core', 'type-feature']
title = 'support .format for bytes'
updated_at =
user = 'https://github.com/benjaminp'
```
bugs.python.org fields:
```python
activity =
actor = 'ncoghlan'
assignee = 'none'
closed = True
closed_date =
closer = 'ncoghlan'
components = ['Interpreter Core']
creation =
creator = 'benjamin.peterson'
dependencies = []
files = ['32009']
hgrepos = []
issue_num = 3982
keywords = []
message_count = 95.0
messages = ['73931', '73935', '73936', '73937', '73938', '73939', '74019', '74021', '74022', '74050', '84121', '84123', '90421', '90423', '90425', '90428', '127210', '130215', '130253', '130284', '163369', '163379', '171791', '171795', '171796', '171799', '171800', '171801', '171803', '171804', '171806', '171815', '171816', '171821', '171824', '180414', '180415', '180416', '180419', '180420', '180423', '180426', '180427', '180430', '180431', '180432', '180433', '180436', '180437', '180439', '180441', '180442', '180445', '180446', '180447', '180448', '180449', '180452', '180453', '180454', '180466', '180489', '180490', '180491', '180492', '180493', '180500', '198112', '199181', '199199', '199203', '199204', '199206', '199207', '199251', '199253', '199254', '199258', '199260', '199264', '199265', '199266', '199267', '199268', '199270', '199271', '199432', '199438', '223976', '223979', '224022', '224023', '266568', '268157', '268160']
nosy_count = 26.0
nosy_names = ['loewis', 'barry', 'brett.cannon', 'terry.reedy', 'gregory.p.smith', 'exarkun', 'ncoghlan', 'pitrou', 'vstinner', 'eric.smith', 'christian.heimes', 'benjamin.peterson', 'glyph', 'ezio.melotti', 'durin42', 'Arfrever', 'arjennienhuis', 'flox', 'ecir.hana', 'uau', 'tshepang', 'underrun', 'martin.panter', 'serhiy.storchaka', 'nlevitt@gmail.com', 'stendec']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue3982'
versions = ['Python 3.5']
```