base64 "legacy" functions violate RFC 3548

5d5674c0-b2f5-475c-90db-243ac14d9ece commented 17 years ago

BPO	1753718
Nosy	@loewis, @warsaw, @ncoghlan, @bitdancer, @vadmium, @serhiy-storchaka
Files	issue-1753718.patch issue-01753718.patch issue-01753718.patch issue-01753718.patch issue-01753718.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['docs'] title = 'base64 "legacy" functions violate RFC 3548' updated_at = user = 'https://bugs.python.org/slinkp' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'docs@python' closed = True closed_date = closer = 'r.david.murray' components = ['Documentation'] creation = creator = 'slinkp' dependencies = [] files = ['41250', '41296', '41306', '41325', '41343'] hgrepos = [] issue_num = 1753718 keywords = ['patch'] message_count = 25.0 messages = ['32496', '32497', '32498', '32499', '255959', '256020', '256353', '256357', '256384', '256396', '256412', '256414', '256503', '256634', '256635', '256636', '256637', '256781', '256783', '256949', '256950', '256953', '256963', '257390', '257941'] nosy_count = 10.0 nosy_names = ['loewis', 'barry', 'ncoghlan', 'slinkp', 'r.david.murray', 'docs@python', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'Isobel Hooper'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue1753718' versions = ['Python 3.4', 'Python 3.5', 'Python 3.6'] ```

5d5674c0-b2f5-475c-90db-243ac14d9ece commented 17 years ago

(Python 2.5.1 and earlier) Apologies for how long this is, but cleaning up history is hard.

There seems to be a lot of historical confusion around Base 64 encoding, and unfortunately the python library docs reflect that confusion.

The base64 module docs (at http://docs.python.org/lib/module-base64.html ) claim to implement RFC 3548, as seen at http://www.faqs.org/rfcs/rfc3548.html ... heck it's even in the page title. (I'll quickly note here that RFC 3548 has recently been obsoleted by RFC 4648, but for purposes of this bug report the two RFCs are the same so I'll just refer to 3548.)

But the "legacy" functions, encode() and encodestring() , add line feeds every 76 characters. That is a violation of RFC 3548, which specifically says "Implementations MUST NOT add line feeds to base-encoded data". RFC 4648 says the same thing.

Obviously we can't change behavior of legacy functions, but I strongly feel the docs should warn you about this violation.

What encode() and encodestring() actually implement is MIME base 64 encoding, as per RFC 2045 (see http://tools.ietf.org/html/rfc2045#section-6.8 ... obsoletes 1521, 1522, 1590) So base64.encodestring() is AFAICT functionally identical to email.base64mime.encodestring (tangent: someday we should consolidate those two functions into one implementation).

What RFC 3548 describes IS implemented by the "modern" interface such as base64.b64encode(), which does not split into lines.

There's also the lower-level binascii.b2a_base64() function which afaict correctly implements RFC 3548 (although it adds a newline at the end, which base64.b64encode() does not; it's not clear to me which is correct per RFC 3548.) At one time, b2a_base64 DID split into lines, but that was fixed: http://sourceforge.net/tracker/index.php?func=detail&aid=473009&group_id=5470&atid=105470 . But unfortunately, its docs still mistakenly say "The length of data should be at most 57 to adhere to the base64 standard." which presumably refers to the old PEM spec that predates even MIME.

So I propose several doc changes:

1) Add a reference to RFC 3548 and/or 4648 to the binascii docs, and remove the cruft sentence about "at most 57".

2) Add a sentence to the base64 docstrings for encode() and encodestring() like this:

"Newlines will be inserted in the output every 76 characters, as per RFC 2045 (MIME)."

3) Change the introductory text in the base64 docs to something like this:

"""There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets. The modern interface, per RFC 3548, does NOT break the output into multiple lines.

The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME). Thus, the legacy interface is not compliant with the newer RFC 3548.

If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead. """

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

I would not say that the older functions violate RFC 4648. They just don't implement it, as they implement some other standard instead.

5d5674c0-b2f5-475c-90db-243ac14d9ece commented 17 years ago

ok then, how about this (last sentence of middle paragraph slightly modified):

"""There are two interfaces provided by this module. The modern interface supports encoding and decoding string objects using all three alphabets. The modern interface, per RFC 3548, does NOT break the output into multiple lines.

The legacy interface provides for encoding and decoding to and from file-like objects as well as strings, but only using the Base64 standard alphabet, and adds newlines every 76 characters as per RFC 2045 (MIME). Thus, the legacy interface does not implement the newer RFC 3548.

If you are encoding data for email attachments and wondering which to use, you can use the legacy functions but you should probably be using the email package instead. """

61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

Barry, as you implemented these new functions: can you fix this appropriately? If not, please unassign.

8b9ab460-64bb-4d37-9a30-429a0f897b44 commented 8 years ago

Attached patch fixes library/base64.rst as requested, and adds a mention of RFC 3548 into the b2a_base64() docs in library/binascii.rst.

I'm not sure I've made the changes against the right version of the docs - I think this might be against the 3.3 docs.

bitdancer commented 8 years ago

See also the discussion in bpo-25495. I will try to review both of these issues soon.

bitdancer commented 8 years ago

I started tweaking this patch, and wound up going through the whole doc and fixing the references to 'byte string' and 'string' throughout, as well as making all the entries consistent in how they reference the function arguments and output (previously some did not reference the output at all, nor was it clear that the output is always bytes). I believe I also clarified some confusing wordings along the way.

Since there are so many changes I need some eyes checking my work before I commit.

Note that the primary motivation for this change (the incorrect claim that both interfaces supported the RFC) is not made by the 2.7 docs, and since those docs are very different now, I don't plan to touch them.

vadmium commented 8 years ago

Left some review comments. I left a comment about the original patch as well, because I didn’t notice the new patch in time :)

Also, maybe we should say the input to the “legacy” MIME decode() function should be multiple lines, since it calls readline() with no line limit.

bitdancer commented 8 years ago

How about we just make the docs more correct and say that input is read until readline() returns an empty bytes object? That should make it clear that a line-oriented file is expected.

bitdancer commented 8 years ago

Updated patch.

vadmium commented 8 years ago

The change to readline() works well.

Any thoughts regarding my other comments? In particular, altchars and ignorechars cannot be arbitrary bytes-like objects.

bitdancer commented 8 years ago

I missed the other comments somehow. Will take a look soon.

bitdancer commented 8 years ago

Updated patch that addresses most of the comments.

bitdancer commented 8 years ago

The intent of the term "bytes-like object" it to make it possible to use it in documentation in the way I have used it here. That the buffer has a len is clearly discussed in the Buffer Protocol documentation, but of course that's only talking about the C level API. Perhaps what is needed is an addition to the bytes-like object description that clarifies that a bytes-like object is a Sequence that supports the buffer protocol? (So: "A Sequence object that supports the Buffer Protocol and...") Do we also need to clarify that the item size must be one byte? That would seem to me to be implicit in the name.

I don't know if what ctypes produces is a bytes-like object in this sense, since I don't understand ctypes very well, but it sounds like it isn't. Trying to wrap it in a memoryview gives an error ('unsupported format \<c'), so I suspect it is not.

I'm adding Nick to nosy, since the commit log says he added the bytes-like object support to this module.

bitdancer commented 8 years ago

The altchars 2 char limit is an assertion. That's a bug that should be dealt with separately. Either it should be turned into an error, or it should be dropped to match the docs. Probably the latter, since it is documented as OK and it might break code that is currently working in -O mode or on python2.

bitdancer commented 8 years ago

Fixed the spurious 'u'.

bitdancer commented 8 years ago

"or on python2" should be "or ported from python2".

Also note that Nick's commit message specifically mentions a test for multi-dimensional input, so the module does indeed conform to the current bytes-like object definition in that regard.

vadmium commented 8 years ago

I understood “bytes-like” to mean what is now defined in the glossary: what is accepted by the C-level y* format for the PyArg parsing functions. I tend to agree with \https://bugs.python.org/issue23756#msg248265\ that it may not have been the best term to choose when you really mean “C-contiguous buffer”.

I am understanding that you take “bytes-like” to be a more specific thing. Perhaps we could instead have two distinct terms, say “C-contiguous buffer”, which is what FileIO.write() and PyArg supports, and “byte sequence”, perhaps implementing an API common to bytes() and memoryview(), which is easier to work with using native Python.

In general, I think ctypes and array.array produce my stricter kind of C-contiguous buffers. But since bpo-15944 native Python code can access these buffers by casting to a second memoryview:

>>> c = ctypes.c_char(b"A")
>>> with memoryview(c) as array_view, array_view.cast("B") as byte_view:
...     print(repr(byte_view[0]))
... 
65

Nick’s commit d90f25e1a705 mentions multi-dimensional input for the “modern” interface. That is not the problem. In \https://bugs.python.org/issue17839#msg198843\ he decided to be less permissive for the “legacy” interface, which seems unnecessary to me.

Anyway, this is all rather off-topic. Apart from the bytes-like errors, the rest of the current patch is good. Even if you committed with those four errors, I can live with that. I think there are similar problems elsewhere in the documentation, HTTPConnection.request() over TLS for instance.

bitdancer commented 8 years ago

The term "bytes-like object" is specifically designed for those situations where python used to say "X does not support the buffer interface" if passed something else. Which is the case here...now it says "a bytes-like object is required". I'm not sure if we fixed that everywhere, but I think we did. It is certainly true in the cases you cite, except that it turns out that ignorechars accepts an ASCII string.

So, if there is any sort of remaining problem, it is a separate issue: my edits match the current error message behavior. I'll fix ignorechars and commit.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset 105bf5dd93b8 by R David Murray in branch '3.5': bpo-1753718: clarify RFC compliance and bytes/string argument types. https://hg.python.org/cpython/rev/105bf5dd93b8

New changeset 92760d2edc9e by R David Murray in branch 'default': Merge: bpo-1753718: clarify RFC compliance and bytes/string argument types. https://hg.python.org/cpython/rev/92760d2edc9e

bitdancer commented 8 years ago

Thanks everyone.

serhiy-storchaka commented 8 years ago

While we are here, may be update docstrings too?

bitdancer commented 8 years ago

That would be a good idea, yes. I thought Martin was doing that as part of bpo-22088, but now that I look at the patch I see he didn't. Martin, do you want to add it to that patch, or should I reopen this?

vadmium commented 8 years ago

I was waiting for you to finish here to avoid any new merge conflicts. Now that you have committed your patch, I will try and work on mine in the next few days, and I am happy to update the doc strings at the same time.

vadmium commented 8 years ago

Uploaded a Python 3 patch to bpo-22088 which includes the doc string changes.

python / cpython

base64 "legacy" functions violate RFC 3548 #45194