internetarchive / warcprox

WARC writing MITM HTTP/S proxy
380 stars 54 forks source link

Record SSL certificates? #13

Open jcushman opened 9 years ago

jcushman commented 9 years ago

I'm curious about teaching warcprox to record SSL certificates. Has there been any internal work or discussion you can share? Do any other crawlers (Heritrix) currently record certificates?

Cf. this discussion of the WARC spec: https://github.com/iipc/warc-specifications/issues/12

nlevitt commented 9 years ago

There hasn't been any internal work or discussion, nor does heritrix currently record certificates. I think it's a great idea though.

The standard python ssl library has getpeercert(), but doesn't have a way to get the full cert chain. Looks like it can be done with the pyopenssl library which warcprox already requires. But it might be enough to record only the peer cert for now.

The main question is how best to record the information in the warc. There are a number of ways we could do it. My inclination at the moment is to write a special metadata record concurrent to the first url recorded with a given certificate. Subsequent urls from the same host could reference that metadata record. The content of the record could be the cert only in pem format. Even better would be something that looks like "openssl x509 -text ..." (which includes the cert in pem format).

What do you think?

JustAnotherArchivist commented 4 years ago

After having wanted to investigate this for a long time, I just finally looked a bit into it since I'd like to add TLS certificate records to other WARC-writing tools written in Python and thought I'd share the key findings here.

While pyOpenSSL does have the get_peer_cert_chain function, it also has (at least) one big issue: certificate validation using the system's certificate trust store will basically only work reliably on common Linux distributions. So something like certifi would be needed in addition, and even with regular updates of that package, the trusted certificates could still differ between Python and all applications using the system store, making things very messy (e.g. "why can't \<WARC-writing tool> archive this site? it works in my browser!!!"). So I don't think this is a viable route. The proper solution would be https://bugs.python.org/issue18233, but that issue has been open for 6.5 years and doesn't look like it will be resolved soon.

Regarding the actual WARC records, I basically arrived at the same idea as you did. On every newly established TLS connection, a metadata record containing the certificate would be written (optionally deduped against previous such records using a revisit), and then all the requests/responses going through that connection would refer to the record using a new WARC header field. That's for the certificate itself. But I think that the other connection parameters should be stored as well, namely the TLS protocol version and the cipher suite. For this, I'd also use a metadata record on every new TLS connection which, like the warcinfo record, contains simply two fields Protocol and Cipher or similar. It could also refer to the certificate record in another field, and then the requests/responses would refer to this "TLS connection" record. I'm not sure what the URI on this record should be exactly, but something like tls://host:port would make sense to me; not https since it's a different layer and should be generic enough to also be used for other TLS-based connections like FTPS.

This seems like the wrong place for discussion about the details (since it's not only about warcprox), but the warc-specifications issue was basically closed with "implementations first please", so I don't know where it would be appropriate.

nlevitt commented 4 years ago

Hey thanks for the research @JustAnotherArchivist.

While pyOpenSSL does have the get_peer_cert_chain function, it also has (at least) one big issue: certificate validation using the system's certificate trust store will basically only work reliably on common Linux distributions.

Why does this come into play? Warcprox doesn't validate certificates currently. https://github.com/internetarchive/warcprox/blob/f77c152037/warcprox/mitmproxy.py#L303 Does get_peer_cert_chain only work properly when validating certs?

For this, I'd also use a metadata record on every new TLS connection which, like the warcinfo record, contains simply two fields Protocol and Cipher or similar.

I like the idea of recording TLS protocol and cipher info. I think it can be expressed with enough brevity to fit in a single warc header on each warc record though. curl -v logs something like this:

* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256

Drawing inspiration from that, we could do something like

WARC-TLS-Connection: TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256

This seems like the wrong place for discussion about the details (since it's not only about warcprox), but the warc-specifications issue was basically closed with "implementations first please", so I don't know where it would be appropriate.

I think this is a fine place for this discussion. Also the #warc channel on IIPC slack. Having trouble finding the warc-specifications issue, link? To me that seems like a good place too, especially if framed as discussion / WIP.

ato commented 4 years ago

I like the idea of recording TLS protocol and cipher info.

I do too.

It looks like there's some variation in the way cipher suites are named. It looks like the ECDHE-RSA-AES128-GCM-SHA256 format may be OpenSSL-specific whereas the IANA registration and RFC5289 call it TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256.

software displayed cipher suite
Chrome "using TLS 1.2, ECDHE_RSA with P-256, and AES_128_GCM."
curl TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
Firefox TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, 128 bit keys, TLS 1.2
Java HttpsUrlConnection.getCipherSuite() TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
Python ssl_socket.cipher() ('ECDHE-RSA-AES128-GCM-SHA256', 'TLSv1.2', 128)

Also noting for discussion there's an existing header proposal (WARC-Protocol) which includes the TLS version but not ciphersuite: https://github.com/iipc/warc-specifications/issues/42

nlevitt commented 4 years ago

Thanks @ato, that's really useful information. I suppose we should try to use the IANA names.

Interestingly I get an IANA name from ssl_socket.cipher() on my mac with python 3.7:

>>> ssl_socket = ssl.wrap_socket(socket.create_connection(('example.com', 443)))
>>> ssl_socket.cipher()
('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)

AFAICT it is using openssl, not some other library. On linux, python 3.5, I see what you see

>>> ssl_socket = ssl.wrap_socket(socket.create_connection(('example.com', 443)))
>>> ssl_socket.cipher()
('ECDHE-RSA-AES128-GCM-SHA256', 'TLSv1/SSLv3', 128)

We might have to hardcode a mapping in warcprox. I found these relevant resources: https://testssl.sh/openssl-iana.mapping.html https://gist.github.com/Chion82/dafc1b209eb94b90f4bf090c6ae694e5

Also noting for discussion there's an existing header proposal (WARC-Protocol) which includes the TLS version but not ciphersuite: iipc/warc-specifications#42

Would you propose we put this info in the WARC-Protocol header?

ato commented 4 years ago

Interestingly I get an IANA name from ssl_socket.cipher() on my mac with python 3.7

Hmm. Looks like OpenSSL uses the IANA names for TLS 1.3 ciphers but its own names for older ciphers. This is on Fedora with openssl 1.1.1d:

>>> ssl.wrap_socket(socket.create_connection(('example.com', 443))).cipher()
('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)
>>> ssl.wrap_socket(socket.create_connection(('nla.gov.au', 443))).cipher()
('ECDHE-RSA-AES128-GCM-SHA256', 'TLSv1.2', 128)

We might have to hardcode a mapping in warcprox.

The command openssl ciphers -stdname seems to print a mapping, but I don't see any obvious way to access the information via the Python module. Grepping Python's source code I dont see any calls to the C function SSL_CIPHER_standard_name(). I suppose we could call it with ctypes but indeed it's probably simpler and more portable to hardcode a mapping.

There's a PEP for a unified Python TLS API which mentions the name problem but it's still a draft: https://www.python.org/dev/peps/pep-0543/#proposed-interface

there's an existing header proposal (WARC-Protocol) which includes the TLS version but not ciphersuite

Would you propose we put this info in the WARC-Protocol header?

Nah, adding protocol-specific details would probably make WARC-Protocol unnecessarily complex to parse. I think we should either amend the WARC-Protocol proposal not to cover TLS or use two separate headers like:

WARC-Protocol: tls/1.2 # or TLSv1.2 .. still not sure what's best to do here
WARC-TLS-Cipher-Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256

I don't have a strong preference either way. I originally included TLS in WARC-Protocol for completeness and to try to represent protocol layering but it may instead be simpler to treat TLS separately as it's not the application layer protocol.

Edit 2023-12-19: See the WARC-Cipher-Suite proposal in https://github.com/iipc/warc-specifications/issues/86