ia0 / data-encoding

Efficient and customizable data-encoding functions in Rust
https://data-encoding.rs/
MIT License
177 stars 23 forks source link

check_trailing_bits still errors at certain lengths for base32 #28

Closed robyoder closed 5 years ago

robyoder commented 5 years ago

When using spec.check_trailing_bits = false; with the base32 spec, I would expect that any string of base32 characters would be decoded successfully with any extra bits ignored. But this is not the case. For strings with a length congruent to 1, 3, or 6 (mod 8), a DecodeError is thrown.

use data_encoding::{BASE32_NOPAD, DecodeError};

fn main() {
    println!("{:?}", decode(b"77777777"));
    println!("{:?}", decode(b"777777777"));
    println!("{:?}", decode(b"7777777777"));
}

fn decode(bytes: &[u8]) -> Result<Vec<u8>, DecodeError> {
    let mut spec = BASE32_NOPAD.specification();
    spec.check_trailing_bits = false;

    spec.encoding().unwrap().decode(bytes)
}

The above code produces the following output:

Ok([255, 255, 255, 255, 255])
Err(DecodeError { position: 8, kind: Length })
Ok([255, 255, 255, 255, 255, 255])

Playground link

I understand that by the canonical definition in the RFC, the second input is invalid base32, but so is the third, and I would expect them to be treated the same when it comes to check_trailing_bits. So for this example, I would expect the second line in the output to be identical to the first.

Is the current behavior in this circumstance intentional, or is it simply a matter of a length check being done first without concern for check_trailing_bits?

For reference, Google Authenticator for Android follows my expectation in this regard, as does Authy for iOS (in my own testing). In the comment linked there you can see that a "string of sixteen 7s ("7...7") and seventeen 7s both decode to the same byte array", which is not the case here.

ia0 commented 5 years ago

Thank you @robyoder for the report.

This is actually expected behavior. The check_trailing_bits permits to configure the behavior for the trailing bits of the last byte (for valid length inputs) and not the trailing bytes. There is currently no convenience configuration to accept inputs of invalid length. Maybe it would be useful to add one, please open an issue if you would benefit from such feature.

However there is a way to decode inputs of invalid lengths using the position of the Length error when calling decode (see documentation). This has very small overhead and can be done with a convenience truncate_decode function:

fn truncate_decode(encoding: &Encoding, input: &[u8]) -> Result<Vec<u8>, DecodeError> {
    match encoding.decode(input) {
        Err(DecodeError {
            position,
            kind: DecodeKind::Length,
        }) => encoding.decode(&input[..position]),
        output => output,
    }
}

Playground link

robyoder commented 5 years ago

Sure, we can work around it and we will, but probably without running the decode function twice.

My question is this: what is the benefit of having check_trailing_bits for only the cases it currently covers? If some algorithm is generating base32 (or some other base) with trailing bits that are not zeros, isn't it plausible that it would also be generating base32 with trailing characters?

One such algorithm would be a naive secret generator that selects characters at random from the base32 alphabet. The final character in the string may represent 1-4 trailing bits, or a full 5 extra bits. I'm curious what kinds of algorithms check_trailing_bits is designed to accommodate if not one like this.

ia0 commented 5 years ago

You're not running the decode function twice, you're just calling it twice on invalid input and one call returns immediately. The truncate_decode has very low overhead as I said (a few nanoseconds) because the length check is done before everything else. The content of the input is not read at all. This is actually one of the reasons why checking length is a different check than checking trailing bits. One doesn't look at the data while the other one does.

The customization provided by the library is meant to configure the implementation, not to adapt to use-cases. This is because the number of possible use-cases is unbounded, while the number of decisions taken in the implementation is finite. So the right design is the one which is currently taken by the library, which is to provide fine granularity control over the implementation. This is the job of the user to define and use the specification adapted to their use-case.

The current configurations have been added to permit to replicate the behavior of other implementations (like the GNU base64 and base32 programs) for compatibility reasons. I wasn't aware that some implementations accept invalid length and some users rely on that. I'll add this as a configuration option (see #29).

The algorithm you describe doesn't need this configuration to be efficiently implemented, because encode_len which returns the number of base32 characters you need to generate a given number of bytes doesn't return invalid lengths. Something like the following would work fine:

fn generate(base: &Encoding, len: usize) -> Vec<u8> {
    let input = vec![b'F'; base.encode_len(len)]; // Randomize content.
    base.decode(&input).unwrap()
}

Playground link