support .format for bytes

benjaminp commented 16 years ago

BPO	3982
Nosy	@loewis, @warsaw, @brettcannon, @terryjreedy, @gpshead, @ncoghlan, @pitrou, @vstinner, @ericvsmith, @tiran, @benjaminp, @glyph, @ezio-melotti, @florentx, @vadmium, @serhiy-storchaka
Files	byte_format.py: Imitate str.format with bytes function

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-feature'] title = 'support .format for bytes' updated_at = user = 'https://github.com/benjaminp' ``` bugs.python.org fields: ```python activity = actor = 'ncoghlan' assignee = 'none' closed = True closed_date = closer = 'ncoghlan' components = ['Interpreter Core'] creation = creator = 'benjamin.peterson' dependencies = [] files = ['32009'] hgrepos = [] issue_num = 3982 keywords = [] message_count = 95.0 messages = ['73931', '73935', '73936', '73937', '73938', '73939', '74019', '74021', '74022', '74050', '84121', '84123', '90421', '90423', '90425', '90428', '127210', '130215', '130253', '130284', '163369', '163379', '171791', '171795', '171796', '171799', '171800', '171801', '171803', '171804', '171806', '171815', '171816', '171821', '171824', '180414', '180415', '180416', '180419', '180420', '180423', '180426', '180427', '180430', '180431', '180432', '180433', '180436', '180437', '180439', '180441', '180442', '180445', '180446', '180447', '180448', '180449', '180452', '180453', '180454', '180466', '180489', '180490', '180491', '180492', '180493', '180500', '198112', '199181', '199199', '199203', '199204', '199206', '199207', '199251', '199253', '199254', '199258', '199260', '199264', '199265', '199266', '199267', '199268', '199270', '199271', '199432', '199438', '223976', '223979', '224022', '224023', '266568', '268157', '268160'] nosy_count = 26.0 nosy_names = ['loewis', 'barry', 'brett.cannon', 'terry.reedy', 'gregory.p.smith', 'exarkun', 'ncoghlan', 'pitrou', 'vstinner', 'eric.smith', 'christian.heimes', 'benjamin.peterson', 'glyph', 'ezio.melotti', 'durin42', 'Arfrever', 'arjennienhuis', 'flox', 'ecir.hana', 'uau', 'tshepang', 'underrun', 'martin.panter', 'serhiy.storchaka', 'nlevitt@gmail.com', 'stendec'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue3982' versions = ['Python 3.5'] ```

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

Antoine Pitrou added the comment: The fact that "there are plenty of other Python applications that don't use Twisted which nevertheless need to emit formatted sequences of bytes" is *precisely* a good reason for this to be discussed more visibly.

I don't think anyone is opposing discussing it. I don't personally think such a discussion would be useful, lots of points of view are represented on this ticket, but please feel free to raise it in whatever forum that you feel would be helpful. (Even if I did object to that I don't see how I could stop you :)).

I'm not sure what the "general case" is.

The "general case" that I'm referring to is the case of an application writing some protocol logic in terms of constructing some bytes objects and passing them to Twisted. In other words, Twisted relied upon Python to provide a convenient way to assemble your bytes into protocol messages, and that was removed in 3.x. We never provided one ourselves and I don't think it would be a particularly good idea to build that kind of basic string-manipulation functionality into Twisted rather than Python.

What I know from Twisted is there are many specific cases where, indeed, binary protocol strings are formed by string formatting, e.g. in the FTP implementation (and for good reason since those protocols are either ASCII or an ASCII superset).

These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor are they an "ASCII superset"); they are ASCII commands interspersed with binary data. It makes sense to treat them as bytes, not text. In many cases - such as when expressing a length, or a checksum - you _must_ treat them as bytes, or you will emit incorrect data on the wire. By the time you're dealing with text - if you ever are - you're already somewhere in the body of the protocol, decorated with appropriate metadata.

But my point about the "general case" is that when implementing a *new* protocol with ASCII commands, or maintaining an existing one, bytes-object formatting is a convenient, expressive and performant way to express the interpolation of values in the protocol stream.

As a workaround, it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.

Protocols like SMTP (c.f. "8-bit MIME") and HTTP put binary data in-line; do you suggest that gzipped content be encoded as latin1 so it can squeeze into python 3's str type? I thought the whole point of the porting pain here was to get a clean separation between bytes and text. This is exactly why I do not particularly want bytes.format() to allow the presence of strs as formatted values, although that *would* make porting certain things easier. It makes sense to do your encoding first, then interpolate.

Code running on both 2.x and 3.x will *by construction* have some performance pessimizations inside it. It is inherent to that strategy. Not saying this is necessarily a problem, but you should be aware of it.

This is certainly true *now*, but it doesn't necessarily have to be. Enhancements like this one could make this performance division go away. In any case, the reason that ported code suffers from a performance penalty is because python 3 has no efficient way of doing this type of bytes construction; even disregarding compatibility with a 2.x codebase, b''.join() and b'' + b'' and (''.format()).encode('charmap') are all slower _and_ more awkward than simply b''.format() or b''%.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Jan 22, 2013, at 3:34 PM, Terry J. Reedy \report@bugs.python.org\ wrote:

I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-. Some actual numbers:

I'm glad that this operation has been optimized, but treating blocks of protocol data as text is a hackish workaround that still doesn't perform as well (even on 3.3+) as bytes formatting in 2.7.

[If speed is really an issue, we could make binary file/socket write methods unicode implementation aware. They could directly access the ascii (or latin-1) bytes in a unicode object, just as they do with a bytes object, and the extra copy could be skipped.]

Yes, speed is really an issue - this kind of message construction is on the critical path of many of the more popular protocols implemented with Twisted. But trying to work around the performance issue by pretending that strings are bytes will just give new life to old bugs. We've been loudly rejecting unicode from sockets I think for as long as Python has had unicode, and that's the way it should remain.

pitrou commented 11 years ago

Le mardi 22 janvier 2013 à 23:34 +0000, Terry J. Reedy a écrit :

Terry J. Reedy added the comment:

>it would probably be reasonable to make these protocols use str objects at the heart, and only convert to bytes after the formatting is done.

I presume this would mean adding 'if py3: out = out.encode()' after the formatting. As I said before, this works much better in 3.3+ than in 3.2-.

So what? We're discussing a feature that, at best, will be present in 3.4 and not before.

pitrou commented 11 years ago

> What I know from Twisted is there are many specific cases where, indeed, > binary protocol strings are formed by string formatting, e.g. in the FTP > implementation (and for good reason since those protocols are either ASCII > or an ASCII superset).

These protocols (SMTP, SIP, HTTP, IMAP, POP, FTP), are not ASCII (nor are they an "ASCII superset"); they are ASCII commands interspersed with binary data.

The "ASCII superset commands" part is clearly separated from the "binary data" part. Your own LineReceiver is able to switch between "raw mode" and "line mode"; one is text and the other is binary.

In many cases - such as when expressing a length, or a checksum - you _must_ treat them as bytes, or you will emit incorrect data on the wire.

This is a non-sequitur. You can fully well take the len() of some *binary data, format it using "%d" in a *string Content-Length header, then encode the headers using utf-8 (or whatever encoding scheme the protocol mandates). Then at the end you concatenate the encoded headers and the body. I'm sure you're already doing the moral equivalent of this, except that the encoding step is absent.

So, yes, it is reasonably possible, and it even makes sense.

This is exactly why I do not particularly want bytes.format() to allow the presence of strs as formatted values, although that *would* make porting certain things easier.

At this point, I would remind you that I'm not againt bytes.format(), but I'd like it to be discussed in the open rather on the bug tracker.

And, yes, starting that discusssion is, IMO, the proponents' job :-)

even disregarding compatibility with a 2.x codebase, b''.join() and b'' + b'' and (''.format()).encode('charmap') are all slower _and_ more awkward than simply b''.format() or b''%.

How can existing constructions be slower than non-existing constructions that don't have performance numbers at all?

Besides, if b''.join() is too slow, it deserves to be improved. Or perhaps you should try bytearray instead, or even io.BytesIO.

terryjreedy commented 11 years ago

After re-reading everything, I have somewhat changed my mind on this proposal. Perhaps 3.0 threw out too much, making it overly difficult to do some things that were to easy in 2.x and to write cross-version code.

String formatting converts all arguments to strings, using str as the default converter, but gives particular attention to formatting ints and floats. It then interpolates the resulting strings into the template string. Until msg180430, posted just half a day ago, I did not see a coherent idea of what bytes.format should be. The main problem is that there is no general bytes converter equivalent to str. I believe this is the core reason bytes.format was eliminated in 3.0.

Much of the discussion here and elsewhere has been about str.format + additions, where the additions would accommodate various possible conversions. But I now see that this was trying to do too much. Guido's subset proposal cuts this all out by proposing to only convert ints and floats as done in 2.x. So bytes.format would only convert ints and floats and otherwise would interpolate bytes into a bytes template. This should cover a large fraction of use cases. The user would be responsible for converting anything else, or converting ints and floats otherwise, with explicit calls to bytes, str.encode, struct.pack, or custom functions*..

I believe only two changes are needed to the specification of str.format, other than the obvious things like prefixing strings with 'b' and changing 'fill character' to 'fill byte'. Since general conversion would not be be done, the '! conversion' field would be eliminated. In the format specifier, the default 's' would mean that the corresponding argument must be a bytes objects, rather than any object converted by str.

# possible portability function for 'other' classes:

if py2: strb = str
else:
  def strb(ob): return str(ob).encode()

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 11 years ago

I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)

gvanrossum commented 11 years ago

I don't believe it either. I find join consistently faster than format:

python2.7 -m timeit -s 'x = [b"x"*1000']*10 'b"".join(x)' 1000000 loops, best of 3: 0.686 usec per loop

python2.7 -m timeit -s 'x = b"x"*1000' '(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 2.37 usec per loop

Try longer strings, same results (though less pronounced):

python2.7 -m timeit -s 'x = [b"x"*10000']*10 'b"".join(x)' 100000 loops, best of 3: 3.54 usec per loop

python2.7 -m timeit -s 'x = b"x"*10000' '(b"{}{}{}{}{}{}{}{}{}{}").format(x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 7.35 usec per loop

I'm guessing the advantage of format() is that it allows the occasional formatting of a float or int.

And % is not significantly faster:

python2.7 -m timeit -s 'x = b"x"*1000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 2.31 usec per loop

python2.7 -m timeit -s 'x = b"x"*10000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 100000 loops, best of 3: 6.81 usec per loop

python2.7 -m timeit -s 'x = b"x"*100000' '(b"%s%s%s%s%s%s%s%s%s%s") % (x, x, x, x, x, x, x, x, x, x)' 1000 loops, best of 3: 565 usec per loop

ericvsmith commented 11 years ago

I think ''.join() will always be faster than ''.format(), for a number of reasons (some already stated):

it doesn't have to pass the format string
it doesn't have to do the __format__ lookup and call the resulting function (although I believe there's an optimization for str)
it doesn't have to consider the conversion and formatting steps

Whether b''.format() would have to lookup and call __format__ remains to be seen. From what I've read, maybe baking in knowledge of bytes, float, and int would be good enough. I suspect there might be some need for datetimes, but I could be wrong.

The above said, code using b''.format() would definitely be easier to write and understand that a lot of individual field formatting followed by a .join().

pitrou commented 11 years ago

Whether b''.format() would have to lookup and call __format__ remains to be seen. From what I've read, maybe baking in knowledge of bytes, float, and int would be good enough. I suspect there might be some need for datetimes, but I could be wrong.

The __bytes method (and/or tpbuffer) may be a better discriminator than \_format. It would also allow combining arbitrary buffer objects without making tons of copies. What it also means is that "format()" may not be the best method name for this. It is less about formatting than about combining.

Also, it's not obvious what "formatting" a number as bytes should do. Should it mimick the bytes constructor:

>>> bytes(5)
b'\x00\x00\x00\x00\x00'

Should it mimick the int to_bytes() method:

>>> (5).to_bytes(4, 'little')
b'\x05\x00\x00\x00'

Numbers currently don't have a __bytes__ method:

>>> (5).__bytes__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '__bytes__'

ericvsmith commented 11 years ago

I retract the datetime comment. Given what we're trying to accomplish, I think we only need to support types that are supported by 2.7's %-formatting.

gvanrossum commented 11 years ago

Remember, the only reason to add this would be to enable writing code that works in both 2.7 and 3.4. So it has to be called .format() and it has to format numbers as decimal strings by default.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Jan 22, 2013, at 11:27 PM, Antoine Pitrou \report@bugs.python.org\ wrote:

Antoine Pitrou added the comment:

The "ASCII superset commands" part is clearly separated from the "binary data" part. Your own LineReceiver is able to switch between "raw mode" and "line mode"; one is text and the other is binary.

This is incorrect. "Lines" are just CRLF (0x0D0A) separated chunks of data. For example, SMTP is always in line-mode, but messages ("data lines") may contain arbitrary 8-bit data.

This is a non-sequitur. You can fully well (...) So, yes, it is reasonably possible, and it even makes sense.

I concede it is possible to implement what you're talking about, but it still requires encoding things which are potentially 8-bit data. Yes, there are many corners of protocols where said data looks like text, but it is an optical illusion.

> even disregarding compatibility with a 2.x codebase, b''.join() and > b'' + b'' and (''.format()).encode('charmap') are all slower _and_ > more awkward than simply b''.format() or b''%.

How can existing constructions be slower than non-existing constructions that don't have performance numbers at all?

Sorry, "in 2.x" :).

Besides, if b''.join() is too slow, it deserves to be improved. Or perhaps you should try bytearray instead, or even io.BytesIO.

As others have noted, b''.join is *not* slower than b''.format for simply assembling strings; b''.join is indeed faster at that and I didn't mean to say it wasn't. The performance improvement shows up when you are assembling complex messages that contain a smattering of ints, floats, and other chunks of bytes; mostly in that you can avoid a bunch of python code execution and python function calls when formatting those values. The trouble with cooking up an example of this is that it starts to involve a bunch of additional code complexity and it requires careful framing to make sure the other complexity isn't what's getting in the way. I will try to come up with one, maybe doing so will prove even this contention wrong.

But, the main issue here is expressiveness, not performance.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Jan 22, 2013, at 11:31 PM, Martin v. Löwis \report@bugs.python.org\ wrote:

I admit that it is puzzling that string interpolation is apparently the fastest way to assemble byte strings. It involves parsing the format string, so it ought to be slower than anything that merely concatenates (such as cStringIO). (I do understand why + is inefficient, as it creates temporary objects)

You're correct about this; see my previous comment.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Jan 23, 2013, at 1:58 AM, Antoine Pitrou \report@bugs.python.org\ wrote:

> Numbers currently don't have a __bytes__ method:
> 
>>>> (5).__bytes__()
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> AttributeError: 'int' object has no attribute '__bytes__'

They do have some rather odd behavior when passed to the builtin though:

>>> bytes(10)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior. (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)

pitrou commented 11 years ago

They do have some rather odd behavior when passed to the builtin though:

>>> bytes(10) b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

It would be much more convenient for me if bytes(int) returned the ASCIIfication of that int; but honestly, even an error would be better than this behavior. (If I wanted this behavior - which I never have - I'd rather it be a classmethod, invoked like "bytes.zeroes(n)".)

I would agree with you, but it's probably too late to change...

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Jan 23, 2013, at 11:02 AM, Antoine Pitrou \report@bugs.python.org\ wrote:

I would agree with you, but it's probably too late to change...

Understandable, and, in any case, out of scope for this ticket.

ericvsmith commented 11 years ago

So it sounds like the use case is (as Glyph said in msg180432):

Provide a transition for users of 2.7's of str %-formatting into a style that's compatible with both str in 2.7 and bytes in 3.4.

In that case the only options I see are to implement __mod or .format for bytes in 3.4. I'd of course prefer to use .format, although __mod would probably make the transition easier (no need to move to .format first). It would probably also make the implementation easier, since there's so much less code in str.__mod__. But let's assume we're using .format [1].

Given the restricted use case, and assuming we using .format, the implementation would not need to support:

Types other than bytes, int, float.
Subclasses of these types with custom formatting.
!s, !r, or !a (none of the ! conversions). [2]

But it would support all of the specifiers for formatting strs (except now for bytes), floats, and ints.

I haven't looked through the str.format or {str,int,float}.__format__ code since the PEP-393 work, so I'm not really sure if we could stringlib-ify the code again, or if it would just be easier to reimplement it as separate bytes-only code.

[1] It's open for debate whether .format or .__mod__ is preferable. [2] Since %-formatting supports %r and %s, this point is arguable.

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

I'd like to put a nudge towards supporting the __mod__ interface on bytes - for Mercurial this is the single biggest impediment to even getting our testrunner working, much less starting the porting process.

pitrou commented 11 years ago

I'd like to put a nudge towards supporting the __mod__ interface on bytes - for Mercurial this is the single biggest impediment to even getting our testrunner working, much less starting the porting process.

Given a spec hasn't been written (bytes.__mod can't support the same things as str.__mod), and nobody seems to step up to write it, I'd say this is unlikely to appear in 3.4.

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.

Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.

ericvsmith commented 11 years ago

If you could write up a concrete proposal, including which format specifiers would be supported, that would be helpful.

Would it be extensible with something like __bformat__?

There's really quite a bit of work to be done to specify how this would work.

ericvsmith commented 11 years ago

Also, with the PEP-393 changes, the implementation will be much more difficult. Sharing code with str (unicode) will likely be impossible, or require much refactoring of the existing code.

pitrou commented 11 years ago

Is there any chance we could just have it work for bytes, ints, and floats? That'd solve the immediate need, and it'd be obviously correct how to have those behave.

You mean "%s" and "%d"?

Punting this to 3.5 basically means we'll have to either wait for 3.5, or do something awful like use cffi to grab sprintf to port Mercurial.

Or write a pure Python implementation.

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

On Tue, Oct 8, 2013 at 11:08 AM, Antoine Pitrou \report@bugs.python.org\wrote:

> Is there any chance we could just have it work for bytes, ints, and > floats? That'd solve the immediate need, and it'd be obviously > correct how to have those behave.

You mean "%s" and "%d"?

Basically, yes.

> Punting this to 3.5 basically means we'll have to either wait for > 3.5, or do something awful like use cffi to grab sprintf to port > Mercurial.

Or write a pure Python implementation.

Hah. Probably too slow for anything beyond a proof of concept, no?

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Oct 8, 2013, at 8:10 AM, Augie Fackler \report@bugs.python.org\ wrote:

Hah. Probably too slow for anything beyond a proof of concept, no?

It should perform acceptably on PyPy ;-).

pitrou commented 11 years ago

> > Punting this to 3.5 basically means we'll have to either wait for > > 3.5, or do something awful like use cffi to grab sprintf to port > > Mercurial. > > Or write a pure Python implementation.

Hah. Probably too slow for anything beyond a proof of concept, no?

If it's only for the Mercurial test suite, that shouldn't be a problem?

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

On Tue, Oct 8, 2013 at 5:11 PM, Antoine Pitrou \report@bugs.python.org\wrote:

Antoine Pitrou added the comment:

> > > Punting this to 3.5 basically means we'll have to either wait for > > > 3.5, or do something awful like use cffi to grab sprintf to port > > > Mercurial. > > > > Or write a pure Python implementation. > > Hah. Probably too slow for anything beyond a proof of concept, no?

If it's only for the Mercurial test suite, that shouldn't be a problem?

It's not just the testsuite though: we do this _all over_ hg itself. For example, status needs to do something like this:

sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': 'some/filesystem/path'})

except we don't know the encoding of the filesystem path (Hi unix!) so we have to treat the whole thing as opaque bytes. It's even more fun for 'log', becase then it's got localized strings in it as well.

vstinner commented 11 years ago

2013/10/8 Augie Fackler \report@bugs.python.org\:

sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': 'some/filesystem/path'})

except we don't know the encoding of the filesystem path (Hi unix!) so we have to treat the whole thing as opaque bytes.

You are doing it wrong. In Python 3, you "should" store filenames as Unicode (str type). If Python fails to decode a filename, undecodable bytes are stored as surrogate characters (see the PEP-383).

The Unicode type became natural in Python 3, as byte string (old "str" type) was natural in Python 2.

sys.stdout.write() expects a Unicode string, not a byte string.

Does it mean that Mercurial is moving to Python 3? Cool :-)

ericvsmith commented 11 years ago

I've lost track what we were talking about. I thought we were trying to support b'\<something>'.format() in 3.4, for a restricted set of arguments.

I don't see how a third-party package is going to help, if the goal is to allow 3.4 to be source compatible with 2.7. And the recent example uses %-formatting, which is not the subject of this ticket.

What proposal is actually on the table here?

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:

What proposal is actually on the table here?

Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

On Oct 8, 2013, at 5:24 PM, STINNER Victor \report@bugs.python.org\ wrote:

STINNER Victor added the comment:

2013/10/8 Augie Fackler \report@bugs.python.org\: > sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': > 'some/filesystem/path'}) > > except we don't know the encoding of the filesystem path (Hi unix!) so we > have to treat the whole thing as opaque bytes.

You are doing it wrong. In Python 3, you "should" store filenames as Unicode (str type). If Python fails to decode a filename, undecodable bytes are stored as surrogate characters (see the PEP-383).

No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way. We're not able to change either our on-disk data format OR our stdout format, even to support a newer version of Python. I don't know the encoding of the filename's bytes, but I _must_ faithfully reproduce them exactly as they are or I'll break tools like make(1) and patch(1). Similarly, if a file goes from ISO-8859-1 to UTF-8, I have to emit a diff that has some ISO bytes and some UTF bytes - it's not in *any* valid encoding. Changing that is a showstopper regression.

The Unicode type became natural in Python 3, as byte string (old "str" type) was natural in Python 2.

sys.stdout.write() expects a Unicode string, not a byte string.

Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

Does it mean that Mercurial is moving to Python 3? Cool :-)

Not likely, honestly. I tackle this when I've got some spare cycles and my ability to handle pain is high. As it stands, I have the test-runner barely working, but it's making wrong assumptions to get there. The best estimate is that it's a year of work to upgrade to Python 3.

----------

Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue3982\

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

On Oct 8, 2013, at 6:19 PM, Glyph Lefkowitz \report@bugs.python.org\ wrote:

Glyph Lefkowitz added the comment:

On Oct 8, 2013, at 2:35 PM, Eric V. Smith wrote:

> What proposal is actually on the table here?

Sorry Eric, you're right, there is too much discussion here. This issue ought to be about .format, like the title says. There should be a separate ticket for %-formatting, since it seems to be an almost wholly unrelated task. While I'm sympathetic to Mercurial's issues, they're somewhat different from Twisted's, in that we're willing to adopt the "one new way" to do things in order to achieve compatibility whereas that would be too hard for Mercurial.

Yeah, my bad too. I suppose I should add a new bug for %-formatting on bytes objects?

Note that for hg, we can't drop Python 2.6 or so (we'll only drop *2.4* if we can do 2.6 and some 3.x from a single source tree) for a while, due to supporting the system interpreter on a variety of LTS platforms.

terryjreedy commented 11 years ago

Augie, to understand what Viktor meant, I suggest reading http://www.python.org/dev/peps/pep-0383/ One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.

08dd92c2-edaa-4e04-9b07-77c02e9133bf commented 11 years ago

On Oct 8, 2013, at 6:28 PM, "Terry J. Reedy" \report@bugs.python.org\ wrote:

http://www.python.org/dev/peps/pep-0383/ One point of the pep is round-trip filenames without loss on all systems, which is just what you say you need.

At a quick skim, likely not good enough, because http://en.wikipedia.org/wiki/Shift_JIS isn't completely ASCII-compatible, and we've got a fair number of users on weird Shift-JIS using platforms.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

On Oct 8, 2013, at 3:19 PM, Augie Fackler wrote:

No, I'm not. In Mercurial, all end-user data is OPAQUE BYTES, and must remain that way.

The PEP-383 technique for handling file names is completely capable of round-tripping exact bytes, given one encoding for both input and output. You can still handle file names this way internally in Mercurial and not risk disturbing any observable output. You do not need to change that in order to do what Victor suggests.

We should get together in some other forum and discuss file-name handling though, since you can't actually round-trip "opaque bytes" through a *filesystem* and not disturb your output.

Ouch. Is there any way to write things to stderr and stdout without decoding and hopelessly breaking user data?

You can use sys.stdout.buffer.write.

terryjreedy commented 11 years ago

Here is a proof of concept Python function, with a minimal test. It is similar to how str.format could be coded in Python, with re.split and ''.join, except that it does not allow anything before : in the format specification. By default (no format spec given), it copies bytes objects without change. If a format specification *is* given, it does not restrict the object, as this code simply uses builtin format sandwiched between decode and encode.

ezio-melotti commented 11 years ago

You can use sys.stdout.buffer.write.

Note that there's no guarantee that sys.stdout.buffer exists, e.g. if sys.stdout has been replaced with a StringIO.

b62e5afe-fdc6-42de-985a-faeb74e5c5a6 commented 11 years ago

Tempting as it is to reply to the comment about 'buffer' not existing, we're way off topic here. Let's please keep further comments on this bug to issues about a 'format' methods on the 'bytes' object.

e3bb7a8a-6d23-40c5-b27f-44cffc49d48a commented 10 years ago

First off, +1 for this feature. It's not just for twisted, but anyone doing anything with binary data (storage, compression, encryption and networking for me) with python since 2.6 will very likely have been using .format for building messages. I know I have and obviously others have been doing so as well.

The advantages of .format to me are:

compatible with 2.6 (porting and single code base support easier)
ease of composition (the format langauge makes it easy to build complex data structures out of bytes)
readability (named fields make complex formats obvious)
consistency (manipulating a block of bytes or characters can be done in a similar way)

Specific comments on the patch supplied by terry.reedy:

it doesn't support named fields
it doesn't handle padding
it doesn't handle nested formats (like '{0:{1}>{2}}'.format(data,pad_char,pad_width)
formatting byte strings with a width embedds the repr of the byte string ( bf(b'{:>10}', [b'test']) == b" b'test'" )

Really this isn't a good way to solve the problem.

Has a PEP been created for this? If not how can I help make that happen?

Including this in 3.5 would be so helpful for us low level systems programmers out there who have lots of code using .format for binary interfaces in python 2.6/2.7 already.

Also, not to add to derailment, but if we're adding a .format for python3 bytes it would be great if .format could pad with the null byte ('\0') which it currently converts to spaces internally (which is strange). Since this unexpected conversion is bad (so padding with null doesn't happen in python2) its more like a bug fix... actually - maybe that's a separate bug to file on the current .format for text...

e3bb7a8a-6d23-40c5-b27f-44cffc49d48a commented 10 years ago

sorry, terry's patch does handle padding - just with the caveats i listed later. i should have removed that bullet.

terryjreedy commented 10 years ago

http://legacy.python.org/dev/peps/pep-0461/ adds % formatting for bytes and bytes array.

Nick, I have the impression that there was a decision to not add bytes.format. Correct? If so, this issue should be closed. If not, what, if anything, has been decided?

ncoghlan commented 10 years ago

Right, bytes.format was considered as part of the PEP-461 discussions, and rejected as an operation that only made sense in the text domain: http://www.python.org/dev/peps/pep-0461/#proposed-variations

With PEP-461 accepted, and PEP-460 withdrawn, that means we won't be adding bytes.format and bytearray.format.

bpo-20284 covers the implementation of PEP-461.

gpshead commented 8 years ago

This came up in the language summit today when discussing twisted. .format() is still not supported on bytes though % is in 3.5.

realistically it sounded like twisted needs to support python 3.4 for many years so they can't rely on bytes having a .format() method that also works on 2.7 anyways... but assuming .format() is only useful for text may still have been an oversight. (i'll have to go re-read PEP-460 and 461 and discussion before commenting further)

e3bb7a8a-6d23-40c5-b27f-44cffc49d48a commented 8 years ago

Gregory - I'm glad that you're willing to consider this again. It still is a constant issue for me, and .format with variable width fields in binary protocols is so the right tool for the job. If there is anything I can do to help get this added to 3.6 let me know. The forward/backward compatibility issue is secondary to me to the flexibility gained from having .format available for bytes.

Also padding with null bytes that don't get converted would be awesome.

ncoghlan commented 8 years ago

The core problem with the idea of adding bytes.format to Python 3 is that the real power of str.format actually lies in the extensible __format__ protocol and the associated format() builtin, as those rely heavily on text-specific assumptions.

I interpreted Amber's comments at the language summit as referring more to our changing tune regarding mod formatting from:

mod formatting is deprecated, use brace formatting instead; to
they're both fully supported, neither is deprecated; to
use brace formatting for text data, mod formatting for binary data

Folks that followed our original "stop using mod formatting" guidance thus needed to change course when it became our recommended technique for formatting binary data.

Since we now know format() and __format__ aren't suitable for binary data (PEP-361 originally included it, and it got dropped as we kept finding awkward corner cases), that means any new binary formatting proposal needs to explain:

how it compares to existing serialisation techniques (mod-formatting, the struct module, text-formatting+encoding, etc)
why it needs to be a builtin method or function rather than a new serialisation module

sosi-deadeye commented 9 months ago

I could have sworn that bytes.format had been implemented. When I needed it once, I came to the realization that this method never existed in Python 3.0, but it did in Python 2.7.

I also remember that bytes.format triggered an error if the input was of data type str.

Who else has this false memory? Is this the Mandela effect?

python / cpython

support .format for bytes #48232