Extant base64-encoded data is accepted; can it be rejected?

david-perez commented 2 years ago

YmxvYg= is technically invalid base64 according to the spec (its Unicode code point length is not divisible by 4, it's missing a padding character), but this crate is able to decode it without issues; I'm guessing because it can unambiguously decode it (decode_allow_trailing_bits has no effect):

fn main() {
    let config =
        base64::Config::new(base64::CharacterSet::Standard, true).decode_allow_trailing_bits(true);
    let decoded = base64::decode_config("YmxvYg=", config);
    dbg!(&decoded);

    let config2 =
        base64::Config::new(base64::CharacterSet::Standard, true).decode_allow_trailing_bits(false);
    let decoded2 = base64::decode_config("YmxvYg=", config2);
    dbg!(&decoded2);
}

Yields:

[src/main.rs:5] &decoded = Ok(
    [
        98,
        108,
        111,
        98,
    ],
)
[src/main.rs:10] &decoded2 = Ok(
    [
        98,
        108,
        111,
        98,
    ],
)

Can I configure the crate so that if fails at decoding inputs like this one?

marshallpierce commented 2 years ago

Trailing bits are a different thing -- see section 3.5. The last chunk of base64 symbols may have unused bits depending on input size, and that config setting controls whether or not having those bits be non-zero is reported as an error or not. (Other buggy impls may set those bits.)

Padding has no utility. It should never have been part of the spec, IMO, but either way it is not needed to decode. The spec leaves open room for implementations to not pad, which is widely done in practice, so this crate doesn't check for length % 4 == 0. If you want to check that, you'll need to do it yourself, but there really isn't much point in doing so, as leaving off padding does no harm whatsoever in practice.

david-perez commented 2 years ago

The spec leaves open room for implementations to not pad

The part of the spec you link to is for base64url, which "should not be regarded as the same as the 'base64' encoding and should not be referred to as only 'base64'. On the other hand, the main spec for base64 says "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise."

which is widely done in practice

Do you have data to back up this claim?

david-perez commented 2 years ago

as leaving off padding does no harm whatsoever in practice

My use case is in trying to decide whether a specification for a service interface modelling language should reject unpadded base64-encoded data. The spec is implemented in several programming langauges, and Rust seems to be alone in its handling here, so it's a source of possible behavior deviations among the different language implementations.

david-perez commented 2 years ago

I acknowledge that padding is generally useless nowadays in most use cases, but my interpretation of the spec is that by default, unpadded base64-encoded data should be rejected unless you have special knowledge of the protocol/transport. I think this should translate, in the context of a generic base64 library like this one, into rejecting things like YmxvYg=, unless the user explicitly opts into more lenient behavior.

That interpretation would be a breaking change for this crate. What do you think of including an option in Config to opt into "strict decoding" requiring padding?

david-perez commented 2 years ago

and Rust seems to be alone in its handling here

Here is some data to back up this claim. This is how some major programming languages' standard libraries (or canonical/foundational base64 library implementations) handle unpadded base64-encoded data like YmxvYg= by default. As you can see, everyone except C++ rejects.

Python

Link to playground: https://replit.com/@dazedviper/Base64-Python#main.py

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    print(base64.b64decode("YmxvYg="))
  File "/nix/store/p21fdyxqb3yqflpim7g8s1mymgpnqiv7-python3-3.8.12/lib/python3.8/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

JavaScript

This is in the Firefox browser:

>> atob("YmxvYg=");
Uncaught DOMException: String contains an invalid character

This is in Chromium browser:

>> atob("YmxvYg=");
Uncaught DOMException: Failed to execute 'atob' on 'Window': The string to be decoded is not correctly encoded.
    at <anonymous>:1:1

Node's Buffer.from seems to accept anything you throw at it though: https://replit.com/@dazedviper/Base64-Node#index.js

Haskell's `base64` library

Link to playground: https://replit.com/@dazedviper/Base64-Haskell#Main.hs

Running Cabal-example...
Left "Base64-encoded bytestring requires padding"

In fact, this library offers a decodeBase64Lenient function to decode unpadded data, noting that it is not RFC 4648-compliant.

Java (and other JVM-based languages)

Link to playground: https://replit.com/@dazedviper/Base64-Java#Main.java

Exception in thread "main" java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
    at java.base/java.util.Base64$Decoder.decode0(Base64.java:837)
    at java.base/java.util.Base64$Decoder.decode(Base64.java:566)
    at java.base/java.util.Base64$Decoder.decode(Base64.java:589)
    at Main.main(Main.java:6)

C++'s boost library

Link to playground: https://godbolt.org/z/o4r9Kv439

Oblivious to padding; it accepts anything.

marshallpierce commented 2 years ago

OK, I'm convinced it's worth detecting for you weirdos who wish base64 was always canonical. :) As it happens, https://github.com/marshallpierce/rust-base64/issues/182 just popped up today, which would be addressed by same thing.

And no, I don't have data re: absence of padding in practice -- it's just something I've noticed in my travels because, as you can imagine, I have a professional interest in base64. ;)

david-perez commented 2 years ago

Oh wow, I didn't know about that paper nor had I considered using this behavior for attacks. I am now entirely convinced that rejecting unpadded encoded data should be the default behavior.

What are the odds that we both report this within days... 🤯

marshallpierce commented 1 year ago

See if #198 addresses your use case.

marshallpierce commented 1 year ago

Released in 0.20.0.

marshallpierce / rust-base64