[BUG] b64decode does not handle whitespaces

martinvuyk commented 2 months ago

Bug description

Detected by @lemire in PR #3443

Currently, b64decode does not appear to handle white-space characters. I would have expected the following to print 'Bonjour', it does not:

from base64 import b64decode
def main():
    var data = b64decode("Qm9 uam91cg==")
    print(data)

output:

Bo������

Steps to reproduce

Include relevant code snippet or link to code that did not work as expected.
If applicable, add screenshots to help explain the problem.
If using the Playground, name the pre-existing notebook that failed and the steps that led to failure.
Include anything else that might help us debug the issue.

System information

- What OS did you do install Mojo on ?
- Provide version information for Mojo by pasting the output of `mojo -v`
`mojo 2024.9.105`
- Provide Modular CLI version by pasting the output of `modular -v`

lemire commented 2 months ago

Note that I did not report it as a bug because the documentation does not seem to imply that white space is handled.

A relevant specification is WHATWG Forgiving Base64 decoding:

https://infra.spec.whatwg.org/#forgiving-base64-decode

C#/.NET follows it, as well as the JavaScript's atob function. Possibly other systems follow it as well.

martinvuyk commented 2 months ago

Forgot to add the specs and what Python does which is what we try to follow

RFC 4648 is what python follows. Section 3.3

Implementations MUST reject the encoded data if it contains
   characters outside the base alphabet when interpreting base-encoded
   data, unless the specification referring to this document explicitly
   states otherwise.  Such specifications may instead state, as MIME
   does, that characters outside the base encoding alphabet should
   simply be ignored when interpreting data

Python:

from base64 import b64decode
print(b64decode("Qm9 uam91cg=="))

output:

b'Bonjour'

in the Python docs:

If validate is False (the default), characters that are neither in the normal base-64 alphabet nor 
the alternative alphabet are discarded prior to the padding check. If validate is True, these 
non-alphabet characters in the input result in a [binascii.Error](https://docs.python.org/3/library/binascii.html#binascii.Error).

lemire commented 2 months ago

@martinvuyk Right. So the base64 algorithm in simdutf can solve this at high speed. It is already used in production. (It is part of WebKit/Safari and Node.js, Bun, etc.)

modularml / mojo