iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

WARC-Cipher-Suite field proposal #86

Open acidus99 opened 1 year ago

acidus99 commented 1 year ago

This field was previously discussed by @ato @nlevitt and @JustAnotherArchivist on an issue in a different repository. That discussion intermixed many topics like the proposed WARC-Protocol field as well as storing X.509 certificates in metadata records. Adding this issue so the idea can be properly discussed and tracked for WARC 1.1+

Proposal

The WARC-Cipher-Suite field is the TLS cipher suite which was used to retrieve any included content. The TLS cipher suite shall be written as the IANA TLS Cipher Suites Value (e.g. TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384).

WARC-Cipher-Suite = "WARC-Cipher-Suite" ":" (cipher)
cipher          = <TLS cipher suite value per IANA's TLS Parameters>

The WARC-Cipher-Suite field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’ records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.

Motivation

Storing the TLS parmeters used to retrieve content is valuable for many use cases (research, archival/postierity, troubleshooting). For example, it could provide context why a request doesn't have a corresponding response record. The proposed WARC-Protocol field is used to record the protocol version. WARC-Cipher-Suite field augments this by including what cipher suite was used. As a bonus, the IANA already defines and standardizes the values of these cipher suites, and those values are already used internally by many tools (especially for more modern ciphers).

Background

Per this thread @nlevitt and @ato both liked the idea of recording TLS protocol and cipher info in a WARC file. @nlevitt originally proposed a single custom field that would include both the TLS protocol version and cipher suite that were negotiated. However given that the WARC-Protocol field was being planned separately @ato recommended using WARC-Protocol to record the TLS protocol version and a new field to record the cipher.

Questions

Edited 2023-12-19 by @ato Renamed from WARC-TLS-Cipher-Suite to WARC-Cipher-Suite as implemented by @Arkiver2 in Wget-AT and agreed to by @acidus99

ikreymer commented 1 year ago

Not specifically opposed to this, but is the cipher suite alone actually useful / actionable? The original issue was around storing the full SSL cert, which arguably has more value. What is the actual problem being solved? For example, do many tools actually store a request record without a corresponding response record in the case of an error? Our (Webrecorder) tools generally don't, one of the reasons being that this is somewhat ambiguous: is the response missing because of a TLS error, a DNS error, other connectivity issues, or was there just an error writing the WARC record. Given just a missing response, I think most tools/users might just assume it was a serialization issue and ignore it. If the intent is to record such errors, perhaps a record type / convention should be created for that purpose specifically.

When crawling via a Chromium-based browser, its possible to get full info about the cert, for example using: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-SecurityDetails One thing we've been currently doing is storing a generic metadata header field like this based on this info:

WARC-JSON-Metadata: {"cert":{"issuer":"GTS CA 1C3","ctc":"0"}}

which conveys two key properties: the issuer of the cert, and whether Chrome thinks it passes Certificate Transparency-compliant. This could be used to distinguish MITM certs for example. We could also add other data there / standardize on a format like this with optional properties, etc..

acidus99 commented 1 year ago

I think separating certificates from TLS cipher info makes sense for several small reasons:

However the big reason to separate them is the amount of questions and nuances of how to efficiently store X.509 certs in WARCs overwhelms the questions for cipher suites if the issues are combined.

A lot of that is obviously up the WARC creator, and things like revisits are optional, and other things are extreme edge cases (e.g. client-side certificates) But there are a lot of decisions and complexity with storing the (often multiple) KB of certificates associated with web resources. Compare that with a WARC field whose value is smaller than a base-16 SHA-256 Content Digest field, I think it's helpful to separate them. 😀

(Personally I do want to hear thoughts on how to store certificates and if things have changed since @JustAnotherArchivist work a few years ago, but thought it made sense to do that in a separate issue)

Arkiver2 commented 11 months ago

The new release of Wget-AT at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.21.3-at.20231213.01 now implements this WARC header. The decision was made to use WARC-Cipher-Suite, instead of WARC-TLS-Cipher-Suite for details outlined in the release notes. The allowed values for TLS and SSL certificates are outlined in the release notes as well.

Next to the WARC-Cipher-Suite header, the WARC-Protocol header is implemented as well according to the proposed definition at https://github.com/iipc/warc-specifications/issues/42.

Wget-AT is used for the Archive Team Warrior projects. As this new Wget-AT version is rolled out to all Warrior projects, the WARC-Cipher-Suite and WARC-Protocol WARC headers will start appearing on hundreds of millions of WARC records that are created every day (which are available at https://archive.org/details/archiveteam).

This release is a first of several releases to improve SSL/TLS recording in WARC records. The two new headers are seen as a 'minimal' representation of the details of the SSL/TLS session.

acidus99 commented 11 months ago

@Arkiver2 nice work! I like your logic behind the WARC-Cipher-Suite naming vs WARC-TLS-Cipher-Suite. @ato I suggest if (and hopefully when) this proposal gets adopted, it uses the WARC-Cipher-Suite field name.

ato commented 11 months ago

Since @acidus99 supports the name change and the only software I could find using the original name WARC-TLS-Cipher-Suite is acidus99/Kennedy I've edited the proposal to the new name WARC-Cipher-Suite. I've left the text of the definition as is but suggestions for how to update the wording to cover the case of SSL would be welcome.

acidus99 commented 11 months ago

I updated Kennedy to use the new WARC-Cipher-Suite field name:

https://github.com/acidus99/Kennedy/commit/2aaf7ace55a62053d35e201c066c0af72a9bdc32