Open acidus99 opened 1 year ago
Not specifically opposed to this, but is the cipher suite alone actually useful / actionable? The original issue was around storing the full SSL cert, which arguably has more value. What is the actual problem being solved?
For example, do many tools actually store a request
record without a corresponding response
record in the case of an error? Our (Webrecorder) tools generally don't, one of the reasons being that this is somewhat ambiguous: is the response missing because of a TLS error, a DNS error, other connectivity issues, or was there just an error writing the WARC record. Given just a missing response, I think most tools/users might just assume it was a serialization issue and ignore it. If the intent is to record such errors, perhaps a record type / convention should be created for that purpose specifically.
When crawling via a Chromium-based browser, its possible to get full info about the cert, for example using: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-SecurityDetails One thing we've been currently doing is storing a generic metadata header field like this based on this info:
WARC-JSON-Metadata: {"cert":{"issuer":"GTS CA 1C3","ctc":"0"}}
which conveys two key properties: the issuer of the cert, and whether Chrome thinks it passes Certificate Transparency-compliant. This could be used to distinguish MITM certs for example. We could also add other data there / standardize on a format like this with optional properties, etc..
I think separating certificates from TLS cipher info makes sense for several small reasons:
However the big reason to separate them is the amount of questions and nuances of how to efficiently store X.509 certs in WARCs overwhelms the questions for cipher suites if the issues are combined.
openssl x509 -noout -text
-style formatted output like @nlevitt suggested? Or just put it in a PEM and let people parse it? What about multiple certs? Is that multiple metadata records, or a single metadata record with multiple PEMs? (FWIW, I'm storing them as PEM in a metadata record with an application/x-pem-file
Content-Type.)Refer-To
-ing to a request record is a server cert or client-side cert? Hopefully the WARC creator stored enough of the certificate to include extended key OID's so the program can, but now it would need to parse the payload to know.revisit
record on every request/response?metadata
record, since the metadata record itself doesn't really have a Target-URI.A lot of that is obviously up the WARC creator, and things like revisits are optional, and other things are extreme edge cases (e.g. client-side certificates) But there are a lot of decisions and complexity with storing the (often multiple) KB of certificates associated with web resources. Compare that with a WARC field whose value is smaller than a base-16 SHA-256 Content Digest field, I think it's helpful to separate them. 😀
(Personally I do want to hear thoughts on how to store certificates and if things have changed since @JustAnotherArchivist work a few years ago, but thought it made sense to do that in a separate issue)
The new release of Wget-AT at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.21.3-at.20231213.01 now implements this WARC header. The decision was made to use WARC-Cipher-Suite
, instead of WARC-TLS-Cipher-Suite
for details outlined in the release notes. The allowed values for TLS and SSL certificates are outlined in the release notes as well.
Next to the WARC-Cipher-Suite
header, the WARC-Protocol
header is implemented as well according to the proposed definition at https://github.com/iipc/warc-specifications/issues/42.
Wget-AT is used for the Archive Team Warrior projects. As this new Wget-AT version is rolled out to all Warrior projects, the WARC-Cipher-Suite
and WARC-Protocol
WARC headers will start appearing on hundreds of millions of WARC records that are created every day (which are available at https://archive.org/details/archiveteam).
This release is a first of several releases to improve SSL/TLS recording in WARC records. The two new headers are seen as a 'minimal' representation of the details of the SSL/TLS session.
@Arkiver2 nice work! I like your logic behind the WARC-Cipher-Suite
naming vs WARC-TLS-Cipher-Suite
. @ato I suggest if (and hopefully when) this proposal gets adopted, it uses the WARC-Cipher-Suite
field name.
Since @acidus99 supports the name change and the only software I could find using the original name WARC-TLS-Cipher-Suite is acidus99/Kennedy I've edited the proposal to the new name WARC-Cipher-Suite. I've left the text of the definition as is but suggestions for how to update the wording to cover the case of SSL would be welcome.
I updated Kennedy to use the new WARC-Cipher-Suite
field name:
https://github.com/acidus99/Kennedy/commit/2aaf7ace55a62053d35e201c066c0af72a9bdc32
This field was previously discussed by @ato @nlevitt and @JustAnotherArchivist on an issue in a different repository. That discussion intermixed many topics like the proposed WARC-Protocol field as well as storing X.509 certificates in
metadata
records. Adding this issue so the idea can be properly discussed and tracked for WARC 1.1+Proposal
The
WARC-Cipher-Suite
field is the TLS cipher suite which was used to retrieve any included content. The TLS cipher suite shall be written as the IANA TLS Cipher Suites Value (e.g.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
).The WARC-Cipher-Suite field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’ records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.
Motivation
Storing the TLS parmeters used to retrieve content is valuable for many use cases (research, archival/postierity, troubleshooting). For example, it could provide context why a
request
doesn't have a correspondingresponse
record. The proposed WARC-Protocol field is used to record the protocol version. WARC-Cipher-Suite field augments this by including what cipher suite was used. As a bonus, the IANA already defines and standardizes the values of these cipher suites, and those values are already used internally by many tools (especially for more modern ciphers).Background
Per this thread @nlevitt and @ato both liked the idea of recording TLS protocol and cipher info in a WARC file. @nlevitt originally proposed a single custom field that would include both the TLS protocol version and cipher suite that were negotiated. However given that the
WARC-Protocol
field was being planned separately @ato recommended usingWARC-Protocol
to record the TLS protocol version and a new field to record the cipher.Questions
WARC-Cipher-Suite
to future proof for other uses beyond TLS? TheWARC-Protocol
field defines what protocol is used (FTP, TLS, or even a successor). This cipher suite field is an additional/optional field, applicable only when used with a WARC-Protocol value that supports encryption, recording what cipher suite was used. Baking "TLS" into the field name may cause a problem in the future. (I can't help but think of software and standards that still use the "SSL Certificate" or "SSL connection" terminology 🤮)~Edited 2023-12-19 by @ato Renamed from WARC-TLS-Cipher-Suite to WARC-Cipher-Suite as implemented by @Arkiver2 in Wget-AT and agreed to by @acidus99