Cannot load DER certificates when trailing data is present

vlaci commented 2 years ago

I found this behavior surprising as OpenSSL (and hence pyOpenSSL) doesn't inhibit this restriction.

My use-case is that I parse certificates from more-or-less unstructured data and thought that I won't use an asn1 parser myself, as cryptography will parse the structures anyway. I was trying to patch this behavior myself but ut was also surprising that I found no easy way to use the asn1 crate in a way to allow extra data at the end.

My suggestion would be to extend the asn1 crate with a function that slurps any remaining data and change the rust binding to (maybe optionally?) use this relaxed behavior. I am not sure that having this very strict parsing behavior is in any way beneficial as it goes against the robustness principle: "be conservative in what you do, be liberal in what you accept from others".

A dirty change describing the requested behavior

```diff diff --git a/src/rust/src/x509/certificate.rs b/src/rust/src/x509/certificate.rs index 4ab8d3702..1b3df189a 100644 --- a/src/rust/src/x509/certificate.rs +++ b/src/rust/src/x509/certificate.rs @@ -348,7 +348,16 @@ fn load_pem_x509_certificate(py: pyo3::Python<'_>, data: &[u8]) -> PyAsn1Result< #[pyo3::prelude::pyfunction] fn load_der_x509_certificate(py: pyo3::Python<'_>, data: &[u8]) -> PyAsn1Result { - let raw = OwnedRawCertificate::try_new(Arc::from(data), |data| asn1::parse_single(data))?; + let raw = OwnedRawCertificate::try_new(Arc::from(data), |data| { + // Initialize to a dummy error as the compiler cannot prove that read_element will be called at all + let cert = std::cell::RefCell::new(Err(asn1::ParseError::new(asn1::ParseErrorKind::ShortData))); + asn1::parse::<(), asn1::ParseError, _>(data, |parser| { + // At this point we can replace the dummy result with whatever the result of the parsing was + cert.replace(parser.read_element()).err(); + Ok(()) + }).unwrap_or(()); // We are discarding the possible ExtraData error here + cert.into_inner() + })?; // Parse cert version immediately so we can raise error on parse if it is invalid. cert_version(py, raw.borrow_value().tbs_cert.version)?; // determine if the serial is negative and raise a warning if it is. We want to drop support ```

kissgyorgy commented 2 years ago

Aslo, any other tool I tried (dumpasn1, pyasn1, openssl x509) can handle these kind of certificates / random data, only cryptography cannot, so I find it inconsistent with the behavior with the "cert parsing ecosystem".

alex commented 2 years ago

So, the robustness principal is bad :-) https://datatracker.ietf.org/doc/html/draft-iab-protocol-maintenance does a good job of laying it why: in the short term you might get interoperability, in the long term your ecosystem is no longer meaningfully guided by the specification, it's instead a victim of "how the largest implementations behave".

Can you explain your actual use case here in more detail, I don't understand the situation where such functionality would be required.

kissgyorgy commented 2 years ago

Can you explain your actual use case here in more detail

We are doing Software Composition Analysis and File Enumeration, searching certificates and private keys in the process wherever we can find them. They can be in an XML, embedded in a PDF document or even more weird places 😄 we can't tell beforehand where they are, so there are always some garbage at the end of the certificates.

We can't parse them with cryptography, because of this restriction. I think I understand why this library is so restrictive (want to be de-facto standard for security purposes, providing easy-to-use APIs, even for beginners), and I agree that it should have strict defaults, but I don't see why it can't be a bit more flexible to support more complex use-cases like ours, and I disagree with the behavior that is completely differently than any other cert parsing tool out there.

alex commented 2 years ago

My immediate recommendation for you would be to use some other DER parser (maybe even a tiny hand-written one) to find the length of the outer TLV, and simply pass that to cryptography so as to avoid this issue.

We try pretty hard to avoid adding additional levels of configuration, particularly in things as fundamental as parsing, so it's unlikely we'd change this behavior ourselves.

vlaci commented 2 years ago

I'd like to know more about your position on this specific issue.

My preposition was that having a pointer locating the start of a certificate in a file/memory is sufficient to unambiguously load it so that having to know the length of the certificate from the library users' code is redundant and in cases (albeit ours may not be a mainstream one I admit) needs partial understanding of the ASN.1 structures that wouldn't be needed otherwise. There are cases like digitally signed files, where additional data is attached at the end of arbitrary file formats which don't tend to break them. It may be ignorance from my part but I don't see any issues accepting surplus data.

On the other hand this is no way a blocker issue for us, I just wanted to shed light on the different (and to me surprising) behavior of this library. To me at this point is more about curiosity, I don't have a strongly held opinion if it is good or bad.

alex commented 2 years ago

Sure, happy to explain why we think this behavior makes sense:

It gives us symmetry between parsing the top-level TLV and inner TLVs. When parsing a nested SEQUENCE in ASN.1 it's very important not to allow trailing data at the end of a TLV, else you get various signature forgery attacks. For this reason rust-asn1 (the underlying library we use) always enforces that all data in any TLV is consumed. The top-level TLV is not different from any inner TLV in this respect, so it's simply a natural consequence of the implementation.

The fact that it naturally falls out of the implementation helps elucidate an important point: We don't know where the data that's parsed to parse_der_x509_certificate() comes from, and in fact it could simply be a TLV from another protocol!

For this reason, this behavior is necessary to fail closed.

If we did want to support parsing data with a TLV + additional trailing data, the API would need to be something with a return value of (Certificate, bytes) so that the caller could know what trailing data was present, as that will be necessary for any correct handling of a protocol that relied on this.

kissgyorgy commented 2 years ago

As @vlaci mentioned, attaching random data (notes, explanation, etc) to a certificate is very common, I just tried out and x509.load_pem_x509_certificate() is parsing a PEM cert with extra data no problem. Is the PEM parsing different, because it doesn't have the mentioned signature forgery attack?

the API would need to be something with a return value of (Certificate, bytes) so that the caller could know what trailing data was present

Alternatively extra bytes, (or maybe even just the position and length of those bytes? as the caller fed them, so they have it) could be attached to the current ValueError as properties.

It would be still loud ("some error happened")
Not a breaking change at all.
Would give the opportunity to the caller to "recover" from the error.

tiran commented 2 years ago

FYI OpenSSL's trusted certificate have an AUX ASN.1 structure appended to certificate data. The AUX sections contains user-defined information such as overrides for trust settings.

vlaci commented 2 years ago

It gives us symmetry between parsing the top-level TLV and inner TLVs. When parsing a nested SEQUENCE in ASN.1 it's very important not to allow trailing data at the end of a TLV, else you get various signature forgery attacks. For this reason rust-asn1 (the underlying library we use) always enforces that all data in any TLV is consumed. The top-level TLV is not different from any inner TLV in this respect, so it's simply a natural consequence of the implementation.

Ohh, I was under the impression that the parser can only trip on the check on is_empty in the finish call at the end of parse. Hence I thought that this behavior can only arise at the outermost layer. I haven't checked how the nested structures are handled. I guess having an consume_remining_data method on the parser which would be called on the outermost parser seems to be possible according to my limited understanding.

Other than that we are going forward parsing the size from the outer DER TLV ourselves. It was nice discussing this issue with you. Thanks for your helpful insights. Wish you all the best!

pyca / cryptography

Cannot load DER certificates when trailing data is present #7340