python / cpython

The Python programming language
https://www.python.org
Other
62.47k stars 29.99k forks source link

binascii documentation incorrect #69681

Closed 5089ffbc-b451-485c-8b0b-c120b6e778bf closed 8 years ago

5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago
BPO 25495
Nosy @birkenfeld, @bitdancer, @vadmium, @matrixise, @mouse07410
Files
  • issue25495.patch
  • issue25495.base64.2.7.patch: For 2.7
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['docs'] title = 'binascii documentation incorrect' updated_at = user = 'https://github.com/mouse07410' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'docs@python' closed = True closed_date = closer = 'r.david.murray' components = ['Documentation'] creation = creator = 'mouse07410' dependencies = [] files = ['40878', '40953'] hgrepos = [] issue_num = 25495 keywords = ['patch'] message_count = 27.0 messages = ['253572', '253577', '253607', '253625', '253627', '253710', '253744', '253766', '253918', '253920', '253921', '253922', '253924', '253935', '253955', '254011', '254100', '254118', '254122', '254141', '254148', '254195', '254820', '256021', '256348', '256349', '256354'] nosy_count = 7.0 nosy_names = ['georg.brandl', 'r.david.murray', 'docs@python', 'python-dev', 'martin.panter', 'matrixise', 'mouse07410'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue25495' versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6'] ```

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    binascii b2a_base64() documentation says:

    The length of data should be at most 57 to adhere to the base64 standard.

    This is incorrect, because there is no base64 standard that restricts the length of input data, especially to such a small value.

    What RFC4648 (that superseded RFC3548 that your documentation still keeps referring to) actually says is that MIME enforces the limit ofthe OUTPUT LINE length at 76, but NOT of the entire output, and certainly not of the entire input.

    Please correct the documentation, making it conformant with what the ACTUAL base64 standard says.

    See https://en.wikipedia.org/wiki/Base64 and https://tools.ietf.org/html/rfc4648

    Thanks!

    bitdancer commented 8 years ago

    I agree that the documentation is not optimal. To give you some background, binascii was primarily implemented to support the email module, and the standard it is referring to is in fact the MIME standard that references base64 (I believe at the time the independent base64 standard did not exist). The number 57 is derived from the fact that 57 * 4/3 = 76; that is, input data of length 57 will result in an encoded line that is equal to the maximum recommended line length. Thus in encoding an email the email package (originally, it no longer does this) passed the data in in 57 byte chunks and appended the resulting lines to the output buffer.

    So, this documentation is historically correct, but no longer particularly useful. Suggested improvements are welcome.

    This state of affairs exists because the binascii module doesn't really have a current maintainer. Someday I'd love to have enough time to pick it up, since I maintain email and it is still used by email (and could be used better, with some module improvements).

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    Yes I know where this came from. :-)

    Here is my proposed change.

    Replace the statement

    The length of data should be at most 57 to adhere to the base64 standard.

    with:

    To be MIME-compliant, the Base64 output (as defined in RFC4648) should be broken into lines of at most 76 characters long. This post-processing of the output is the responsibility of the caller. Note that the original PEM context-transfer encoding limited line length to 64 characters.

    Would this change be agreeable to you?

    matrixise commented 8 years ago

    Here is a patch with the submitted description.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    Thank you for the quick turn-around, and for taking care of this issue!

    vadmium commented 8 years ago

    Thanks for bringing this up, it has often bugged me.

    My understanding, based on the original wording and places where this is used, is that the data is often pre-processed into 57-byte chunks, rather than post-processing it. Maybe it is worthwhile restoring that info. It should help understanding the presence of the newline parameter (or why a newline is always added in 3.5).

    Also, the link between RFC 4648 and this function could be made even more explicit. Maybe move “as defined” into the first sentence, or change “the Base64 output” to “the function’s output”.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    As far as I remember, the data was not "originally processed in 57-byte chunks". I've been around the first PEM and MIME standards and discussions (and code, though not in Python, which wasn't around then) to be in position to know. :)

    Whether the user prefers to process data in chunks or not, is up to the user. Not to mention that PEM is long gone, and MIME also changed somewhat.

    The link between this function and RFC4648 can and should be more explicit, but I think just referring to it is enough.

    Do you have a recommendation for additional info to explain newline issues?

    Yes, changing "Base64 output" to "function output" makes perfect sense.

    Thanks!

    vadmium commented 8 years ago

    I was only referring to the original Python documentation and library. See the base64.encode() implementation for an example which does do this 57-byte pre-chunking. Simplified:

    MAXLINESIZE = 76 # Excluding the CRLF
    MAXBINSIZE = (MAXLINESIZE//4)*3  # 57
    ...
    while True:
        s = input.read(MAXBINSIZE)
        if not s:
            break
        line = binascii.b2a_base64(s)
        output.write(line)

    Here’s my attempt to rewrite the doc (3.6 version):

    ''' Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true.

    To be MIME-compliant, base 64 output should be broken into lines at most 76 characters long. One way to do this is to call this function with 57-byte chunks and newline=True. Also, the original PEM context-transfer encoding limited the line length to 64 characters. '''

    But if PEM is long gone as you say, perhaps we don’t need that last sentence?

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago
    1. I concede knowing nothing about the early Python library implementation, functionality, or even purpose.

    2. I don't think it makes sense now to either refer to PEM. We'd be two decades too late for that (well, 27 years, to be precise :). See https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail

    3. I don't think we are in position to tell programmers how to split a string of characters into 76-long chunks. Not to mention that the example you gave is likely to suffer in performance (just count those function calls), compared to other methods, and won't reflect well on the authors.

    Here's one possible doc version:

    ''' Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true.

    If the output is used as Base64 transfer encoding for MIME (:rfc: 2045), base 64 output should be broken into lines at most 76 characters long to be compliant. Base64 encoding standard does not limit the maximum encoded line length. '''

    bitdancer commented 8 years ago

    Add a parenthetical "(57 bytes of the input per line)" and I'll be happy with that.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    Let's not insinuate anything about the input. This is about what constraints on the OUTPUT MAY be there, not a tutorial from the 80-ties on how one might accomplish it.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    And even those constraints depend on the use. E.g. X.509 does not have those.

    bitdancer commented 8 years ago

    Please take a look at the Examples section of this:

    http://perldoc.perl.org/MIME/Base64.html

    Looks kind of like Martin's suggestion :)

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago
    1. I am OK with the following text, modeling referred Perldoc:

    b2a_base64( $bytes, $eol );

    Encode data by calling the encode_base64() function. The first argument is the byte string to encode.

    The second argument is optional, and provides the line-ending sequence to use. When it is given, the returned encoded string is broken into lines of no more than 76 characters each and it will end with $eol unless it is empty. Pass an empty string, or no second argument at all if you do not want the encoded string to be broken into lines.

    1. I already had people telling me that "Python-3 doc prohibits input longer than 57 bytes, even though it doesn't currently enforce it". Please help putting end to spreading of this confusion.
    vadmium commented 8 years ago

    FWIW that Perl function looks like it does the line splitting for you. It mentions “chunks that are a multiple of 57 bytes”. The Python function does not do any line splitting. You have to use base64.encodebytes(), codecs.encode(encoding="base64") or perhaps something in the email package (or user code) for that.

    I think we all agree that there is no hard limit of 57. I have avoided this function in the past due to the documentation. The question is whether the documentation should mention that number in a more accurate context, or not at all.

    Personally I don’t see much harm in mentioning the 57-byte input chunking, as long as it is obvious it is not the only option. I don’t have a strong view; I am just trying to be conservative.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    The harm in mentioning the 57-byte chunking is that so far it successfully confused people.

    b2a_base64() function is not coupled to MIME. It has no constraints on either its input, or its output. *IF* it is used by (in) a MIME application, then the caller may want to make its output RFC 2045-compliant, by whatever way he chooses. Giving (an unwelcome) advice to a writer of one specific application is in my opinion completely out of scope here. Justification that it used to matter 25 years ago and therefore should be kept here doesn't make sense to me.

    I strongly insist that this "chunking" thing does not belong, and must be removed.

    vadmium commented 8 years ago

    Perhaps we can focus on the Python 2 version where there is always a newline appended. Here is a possible patch.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    Unfortunately, NO. The problem (and this bug report) is for Python-3 documentation, so trying to address it in Python-2 rather than in Python-3 does not make sense.

    We seem to both understand and agree that there is no length limitation on b2a_base64() input, either recommended or enforced - contrary to what the current Python-3 documentation implies.

    We understand that *if the *output of this function is intended for use in MIME (rather than X.509 or whatever else Base64 is good for), then the caller should do other things besides calling b2a_base64(), and in all likelihood the caller is already aware of that - after all, if he figured that he needs Base64 in his stuff, he probably knows something about what MIME standards say and require?.

    I repeat my original complaint: Python-3 documentation is buggy because it implies a restriction on the input that is not there. This reference should be removed from there because it confuses people. I've talked to those confused personally, so this is first-hand.

    I refer you to the original msg253572 of this bug report.

    If you want to write a MIME-in-Python tutorial, it is up to you - but b2a_base64() does not seem to be the right place for it.
    (And I'd rather see an X.509 tutorial if you're dead set on writing something besides strict plain b2a_base64() doc. :-)

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    To add: I do not understand your attachment to that 57 "...(exactly 57 bytes of input data per line)", and request that this parenthesized sentence is removed from your Python-2.7 doc patch.

    Please give the reader the benefit of the doubt, and allow that *if he wants to repeatedly call b2a_base64() instead of splitting its output - the ability to compute (76 3 / 4) is within his skill level.

    birkenfeld commented 8 years ago

    bpo-25495.base64.2.7.patch looks good to me. A similar patch can be adapted for 3.x.

    vadmium commented 8 years ago

    Mouse, I know you originally opened this against 3.5. Apart from the module description at the bottom, my patch should be valid for 3.5 also. The relevant wording is identical to 2.7.

    I have resisted removing the magic number 57 for a couple of reasons. Reading existing code that uses this number may be harder. David said he would be happier with it kept. I believed we could solve your original complaint and explain why the number was really there at the same time. It helps explain how the function was originally to be used, and why the newline is appended.

    Anyway, I think it is best if I let this go, and someone else pick it up.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    my patch should be valid for 3.5 also. The relevant wording is identical to 2.7.

    OK.

    I have resisted removing the magic number 57 for a couple of reasons. Reading existing code that uses this number may be harder.

    You expect to see "existing code that uses this number" in Python-3.5+? Interesting... (Care to point me at a couple of samples of such "existing" Python-3 code?) And you expect that the main info source for understanding the reason behind that "57" (assuming this function is invoked that way, as opposed to splitting the output :) would be the doc for this function, rather than the main program, or RFC 2045, or...? Fine.

    It helps explain how the function was originally to be used, and why the newline is appended.

    Pardon me, but why do you think anybody would care...? There are tons of functions, old and new, with more new ones popping up fast enough. I'd really envy a person who has time to enjoy history of one minuscule function of an old (albeit still useful :) library.

    OK. You think a history of this function should be documented - fine. I don't need it (and don't think anybody else wants to read it either), but it's not my doc or my decision.

    Just get the darn bug fixed.

    5089ffbc-b451-485c-8b0b-c120b6e778bf commented 8 years ago

    Status...?

    bitdancer commented 8 years ago

    See also bpo-1753718. I will try to review both of these issues soon.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 7b137466e879 by R David Murray in branch '2.7': bpo-25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/7b137466e879

    New changeset 3d5bf9bd15a3 by R David Murray in branch '3.4': bpo-25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/3d5bf9bd15a3

    New changeset ea9951598bab by R David Murray in branch '3.5': Merge: bpo-25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/ea9951598bab

    New changeset 35650db28afe by R David Murray in branch 'default': Merge: bpo-25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/35650db28afe

    bitdancer commented 8 years ago

    I kept the 57 as part of an historical note explaining why the newline is added. I dropped that sentence in the 3.6 docs, where a keyword to control the apending of the newline has been added.

    vadmium commented 8 years ago

    Thanks for fixing this up David