wbond / asn1crypto

Python ASN.1 library with a focus on performance and a pythonic API
MIT License
331 stars 140 forks source link

Misbehaving URIs #195

Open mttcpr opened 3 years ago

mttcpr commented 3 years ago

It seems that when a certificate contains a URI that lacks a hostname (e.g. ldap:///cn=xyz) there is some misbehavior decoding it. Specifically, the .native ends up with "ldap:/cn=xyz" - two of the three slashes were removed. calling tbs_cert.dump(force=True) also results in the URI being encoded with only one slash. I spent a little time stepping into the code, but quickly became frustrated trying to figure out where it was coming from. I'm hoping you'll read this and say 'Aha I know exactly why that is happening' and be able to easily patch it.

joernheissler commented 3 years ago

Sounds related to #33. https://github.com/wbond/asn1crypto/blob/master/asn1crypto/_iri.py should contain the code which modifies your URIs.

mttcpr commented 3 years ago

I tracked it down to use of urlunsplit from urllib/parse.py. If I add 'ldap' to its uses_netloc list, then it produces the correct output. urllib has no specific support for an ldap uri. As much as I would love to kick ldap to the curb myself, this seems somewhat problematic given ldap is definitely still in use for PKI, particularly PKI from Redmond.

I haven't followed these changes, but do we really need this level of data manipulation (i.e. breaking every URI into all its components and reassembling them) to populate the .native field? Perhaps it's not unreasonable to assume URIs should be processed using whatever library is responsible for handling the scheme outside of the certificate implementation.

At this point the only resolution I can see is getting support for ldap in urllib, or, not using this functionality from urllib, at least not in the way it's being used today.

wbond commented 3 years ago

I haven't followed these changes, but do we really need this level of data manipulation (i.e. breaking every URI into all its components and reassembling them) to populate the .native field?

The .native property is designed explicitly to take ASN.1 binary data and produce native Python data types. In the case of a URI, the most native Python representation is a unicode string, in normalized form. There is data in the standard about how URIs are compared for equality (linked in the code for the URI class): https://tools.ietf.org/html/rfc5280#section-7.4.

I would imagine we should be able to fix the issue with the ldap URIs without too much work.

Perhaps it's not unreasonable to assume URIs should be processed using whatever library is responsible for handling the scheme outside of the certificate implementation.

In this case, I would say the .native field isn't what you want, and you should probably ask for the raw binary data (.contents) and use it how you see fit. However, as I mentioned above, it isn't super trivial to make sure you are actually comparing URIs properly if you treat them as binary data.

mttcpr commented 3 years ago

Thanks Will, using .contents is in fact exactly what I did on the read side. The reason this behavior became more problematic for me was the tbs.dump(force=True) resulted in an invalid URL in the output. For now I have sidestepped the observed problem by adding urllib.parse.uses_netloc.append('ldap') to my own __init__, but I am a little worried there may be other cases related to ldap and urlunsplit that haven't presented themselves.

wbond commented 3 years ago

If you've got a fix and the time to submit it, I'm sure @joernheissler or myself can review it.

More robust tests for ldap is certainly welcome if you've got some fixtures you can provide.

joernheissler commented 3 years ago

A full test case, or even just a certificate, would help too.

mttcpr commented 3 years ago

I do have that band-aide.. not sure runtime modification of urllib's behavior rises to the level of fix? All I did was add this:

import urllib.parse

if 'ldap' not in urllib.parse.uses_netloc:
    urllib.parse.uses_netloc.append('ldap')

Attached is a sample domain controller certificate with affected URIs in both CRLDP and AIA. dc-cert.zip

I'm afraid I don't have time to dedicate to adding tests right now, but I may be able to carve out some time over the holidays.

zito commented 1 year ago

Hi, I have a simple script to dump all user certificates from Active Directory and create their sha1 fingerprint list for usage in Postfix (check_ccert_access) using asn1crypto. Today I have found a bad fingerprint of an user causing mail access failed for him. An openssl fingerprint and a fingerprint obtained from certificates sha1_fingerprint atribute differs. I tried to dump a parsed certificate back to file and found out the problem this bugreport is about. For this time I have to switch the script to execute openssl -fingerprint for every certificate in the list. It is time and resource expensive. I will be happy to return back to asn1parse. Original DER encoded certificate is jsm.cer, parsed/dumped by asn1crypto is jsm.out.cer:

zito@bobekpc:~/tmp/asn1crypt-bug$ diff -u <(openssl x509 -text -noout -inform der -in jsm.cer) <(openssl x509 -text -noout -inform der -in jsm.out.cer)

--- /dev/fd/63  2022-12-21 15:54:43.323228438 +0100
+++ /dev/fd/62  2022-12-21 15:54:43.323228438 +0100
@@ -68,10 +68,10 @@
                 C0:F4:23:F4:3F:D2:46:D8:FD:4A:54:F7:56:41:AB:8A:E1:96:00:A7
             X509v3 CRL Distribution Points: 
                 Full Name:
-                  URI:ldap:///CN=ICZ%20a.s.%20Issuing%20CA%202,CN=SPRG018,CN=CDP,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=ad,DC=i,DC=cz?certificateRevocationList?base?objectClass=cRLDistributionPoint
+                  URI:ldap:/CN=ICZ%20a.s.%20Issuing%20CA%202,CN=SPRG018,CN=CDP,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=ad,DC=i,DC=cz?certificateRevocationList?base?objectClass=cRLDistributionPoint
                   URI:http://www.i.cz/ca/ICZ%20a.s.%20Issuing%20CA%202.crl
             Authority Information Access: 
-                CA Issuers - URI:ldap:///CN=ICZ%20a.s.%20Issuing%20CA%202,CN=AIA,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=ad,DC=i,DC=cz?cACertificate?base?objectClass=certificationAuthority
+                CA Issuers - URI:ldap:/CN=ICZ%20a.s.%20Issuing%20CA%202,CN=AIA,CN=Public%20Key%20Services,CN=Services,CN=Configuration,DC=ad,DC=i,DC=cz?cACertificate?base?objectClass=certificationAuthority
                 CA Issuers - URI:http://www.i.cz/ca/ICZ%20a.s.%20Issuing%20CA%202.crt
             1.3.6.1.4.1.311.25.2: 
                 0@.>.

jsm.zip

wbond commented 1 year ago

I’m happy to review a PR for this if it includes tests.

MatthiasValvekens commented 1 year ago

Incidentally, there's a separate but related issue that will cause capitalisation in certain parts of an URI to be normalised when .dump(force=True) is called---I'd have to dig around to find a good example, but bear with me. Since this type of reencoding arguably goes beyond the requirements of DER, I think it's fair to call that a bug too.

So, while I agree that the parsing issue needs to be fixed, I'd potentially approach the reencoding part of the story a little more generally: IMO the behaviour of .dump(force=True) should be to encode the original string as DER, but it shouldn't apply any further normalisation other than what is required by the letter of the spec.

What do you think?

zito commented 1 year ago

After some investigation today I found a bug, that causes exhibition of this URI LDAP bug. So maybe another issue should be created for it. This problem with ldap slashes should not occur in normal situation after loading of a X509 certificate, because it is DER encoded and there is no reason for its reencoding. I have 851 certificates in AD and failure occurs for 6 of them. By tracing of parsing X509 certificates affected and not affected by the bug a difference is on the lines

        # If the length is indefinite, force the re-encoding
        if self._header is not None and self._header[-1:] == b'\x80':
            force = True

The above code is in core.py on more places. A condition self._header[-1:] == b'\x80' is not correct. Indefinite encoding can't be determined such easily. All my problematic X509 certificates starts with header

zito@bobekpc:~/tmp/asn1crypt-bug$ head --bytes 4 jsm.cer | xxd
00000000: 3082 0880                                0...

The contents length is 2176 (0x880) octets. The condition above tries probably detect indefinite encoding, but indefinite encoding has value 0x80 at the first and only length octet, not last octet if I comprehend correctly BER.

wbond commented 1 year ago

That sounds like a possible bug, but the core bug here is that the DER re-encoding is changing the URL due to a bug in the Python URL functionality in the stdlib.

zito commented 1 year ago

Yes the issue is already created #249 ...