Closed david-perez closed 1 year ago
Trailing bits are a different thing -- see section 3.5. The last chunk of base64 symbols may have unused bits depending on input size, and that config setting controls whether or not having those bits be non-zero is reported as an error or not. (Other buggy impls may set those bits.)
Padding has no utility. It should never have been part of the spec, IMO, but either way it is not needed to decode. The spec leaves open room for implementations to not pad, which is widely done in practice, so this crate doesn't check for length % 4 == 0
. If you want to check that, you'll need to do it yourself, but there really isn't much point in doing so, as leaving off padding does no harm whatsoever in practice.
The spec leaves open room for implementations to not pad
The part of the spec you link to is for base64url
, which "should not be regarded as the same as the 'base64' encoding and should not be referred to as only 'base64'. On the other hand, the main spec for base64
says "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise."
which is widely done in practice
Do you have data to back up this claim?
as leaving off padding does no harm whatsoever in practice
My use case is in trying to decide whether a specification for a service interface modelling language should reject unpadded base64-encoded data. The spec is implemented in several programming langauges, and Rust seems to be alone in its handling here, so it's a source of possible behavior deviations among the different language implementations.
I acknowledge that padding is generally useless nowadays in most use cases, but my interpretation of the spec is that by default, unpadded base64-encoded data should be rejected unless you have special knowledge of the protocol/transport. I think this should translate, in the context of a generic base64 library like this one, into rejecting things like YmxvYg=
, unless the user explicitly opts into more lenient behavior.
That interpretation would be a breaking change for this crate. What do you think of including an option in Config
to opt into "strict decoding" requiring padding?
and Rust seems to be alone in its handling here
Here is some data to back up this claim. This is how some major programming languages' standard libraries (or canonical/foundational base64 library implementations) handle unpadded base64-encoded data like YmxvYg=
by default. As you can see, everyone except C++ rejects.
Link to playground: https://replit.com/@dazedviper/Base64-Python#main.py
Traceback (most recent call last):
File "main.py", line 3, in <module>
print(base64.b64decode("YmxvYg="))
File "/nix/store/p21fdyxqb3yqflpim7g8s1mymgpnqiv7-python3-3.8.12/lib/python3.8/base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
This is in the Firefox browser:
>> atob("YmxvYg=");
Uncaught DOMException: String contains an invalid character
This is in Chromium browser:
>> atob("YmxvYg=");
Uncaught DOMException: Failed to execute 'atob' on 'Window': The string to be decoded is not correctly encoded.
at <anonymous>:1:1
Node's Buffer.from
seems to accept anything you throw at it though: https://replit.com/@dazedviper/Base64-Node#index.js
base64
libraryLink to playground: https://replit.com/@dazedviper/Base64-Haskell#Main.hs
Running Cabal-example...
Left "Base64-encoded bytestring requires padding"
In fact, this library offers a decodeBase64Lenient
function to decode unpadded data, noting that it is not RFC 4648-compliant.
Link to playground: https://replit.com/@dazedviper/Base64-Java#Main.java
Exception in thread "main" java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
at java.base/java.util.Base64$Decoder.decode0(Base64.java:837)
at java.base/java.util.Base64$Decoder.decode(Base64.java:566)
at java.base/java.util.Base64$Decoder.decode(Base64.java:589)
at Main.main(Main.java:6)
Link to playground: https://godbolt.org/z/o4r9Kv439
Oblivious to padding; it accepts anything.
OK, I'm convinced it's worth detecting for you weirdos who wish base64 was always canonical. :) As it happens, https://github.com/marshallpierce/rust-base64/issues/182 just popped up today, which would be addressed by same thing.
And no, I don't have data re: absence of padding in practice -- it's just something I've noticed in my travels because, as you can imagine, I have a professional interest in base64. ;)
Oh wow, I didn't know about that paper nor had I considered using this behavior for attacks. I am now entirely convinced that rejecting unpadded encoded data should be the default behavior.
What are the odds that we both report this within days... 🤯
See if #198 addresses your use case.
Released in 0.20.0.
YmxvYg=
is technically invalid base64 according to the spec (its Unicode code point length is not divisible by 4, it's missing a padding character), but this crate is able to decode it without issues; I'm guessing because it can unambiguously decode it (decode_allow_trailing_bits
has no effect):Yields:
Can I configure the crate so that if fails at decoding inputs like this one?