Feature Request - Machine readable version of /certificates

jvanasco commented 3 years ago

I believe it would be useful to have a machine readable version of the information in /certificates. That would allow for client developers and integrators to quickly check if anything has changed, and aid in automatically tracking certificate lineage.

The data could be maintained in a json file, similar to those in /data and it could serve two purposes:

it could be published statically
the data file could be used for templating all the language translations of "/certificates". the letsencrypt staff would only have to update one json file to create updates for all the language translation pages.

tdelmas commented 3 years ago

I agree that using a similar model than for CT would be interesting (https://github.com/letsencrypt/website/blob/master/data/transparency.json)

aarongable commented 3 years ago

This is a good idea! I don't expect to get to it very soon, since I'm pretty focused on both the upcoming chain switch and ECDSA issuance, but will keep this in mind (or would be happy to review if someone else tackled it!).

GriffinSoftware commented 3 years ago

The numerous PRs I've made recently against the Chain of Trust page have been directed towards supplementing and structuring the information therein with an eye towards a PR to overhaul the entire page to streamline the presentation and remove redundancies. What say you all towards construction of a JSON tree structure corresponding to the diagram (that @aarongable is hopefully going to commit soon ; ) )?

The initial vision I have is that of an array of root certificate objects (including DST Root CA X3) with each having amongst its properties an array of "signed" intermediate certificate objects.

aarongable commented 3 years ago

I'd recommend a structure more like this:

[
  {
    "displayname": "ISRG Root X1",
    "algorithm": "RSA 4096",
    "o": "Internet Security Research Group",
    "cn": "ISRG Root X1",
    "type": "root",
    "status": "active",
    "certificates": [
      {
        "displayname": "Self-signed by ISRG Root X1",
        "crtsh": <url>,
        "txt": <url>,
        "pem": <url>,
        "der": <url>
      },
      { <repeat for cross-sign> }
    ]
  },
  { <repeat for other certs> }
]

The types would be root and intermediate, the statuses would be active, upcoming, backup, and retired.

I suggest this format because the json files in the data directory are consumed by Hugo, and Hugo's templating isn't going to be good at arbitrary-depth descent through a tree of nested roots and intermediates. Of course, you could reflect the structure of the current page more strongly by hauling type and status up to be dictionary keys, but then updating requires moving entire sections rather than simply changing a single value.

petercooperjr commented 3 years ago

Just a couple random thoughts:

If "status" is going to be something machine-understandable, it may make sense to distinguish "retired" between "this intermediate no longer signs certificates" and "all certificates signed by this intermediate have already expired".
Does it make sense to somehow integrate with or get data from CCADB? I guess I just like the idea of there only being one authoritative place for information to be, so maybe this should extract data from CCADB or this should feed its data into CCADB or something like that?

Just brainstorming; these may be terrible ideas.

jvanasco commented 3 years ago

I have been tracking this stuff manually for a while and while I am far from a final form, I do have a small preference/learning — using a flat structure appeared to be better when dealing with the cross signs, and the intermediaries then list the roots that signed them. I also list the IdentTrust/trustId/DST root as well. This allows me to build the full chain - including the trusted root - for extensive tests.

IMHO, the payload should also have the have the notAfter/expiry date for each cert too.

aarongable commented 3 years ago

This page should not feed its data into CCADB; CCADB is authoritative and only a few people have the right/ability to disclose certificates into it. Having this data file be autogenerated from CCADB would be nice, but doing so requires getting someone to create a new public report (similar to https://ccadb-public.secure.force.com/mozilla/CACertificatesInFirefoxReport) listing all certificates owned by Internet Security Research Group (ISRG), so let's save that idea for future improvements.

jvanasco commented 3 years ago

I generated a quick proof-of-concept here: https://github.com/letsencrypt/website/compare/master...jvanasco:feature-machine_readable_certificates

I don't expect this to work as-is, but changes are trivial, as certificates_build.py generates the certificates.json file. The bulk of the work was generating the input data of certificates.

General overview:

I split the certificate payload out from issuer, and split the algorithm into separate type and bits fields. Why? Python is generating this data, and it has it in two fields - so it makes more sense to keep it that way. This script has the same python requirements as Certbot.

The input is a human curated file "_certificate_data.json". It has some basic info about the certs, which can not be derived, such as the URLs and status/type. "_name" is just for editing the input (which could be another file with a "lastmod" date).

The script checks to ensure all the urls are valid and are not duplicated within the payload. It also checks to ensure all the URLs for the certificates are online.

It derives data from the PEM version. it could check the versions against one another. "type" and "status" are copied over. the "signed_by" is used to track the issuer and pegged to the issuer's "pem". if there is a cross-signed version, that is tracked too. the URL of the pem is used as a UUID to link certificates together.

why the flat, not-nested approach?

I keep thinking about how i - and others - would use this data. keeping it flat seems easier and more database like.

the workflow I envision, is that a LetsEncrypt staff member could just alter the input on a file with minimal information, run a script, and a machine readable version that has data which is checked and tested is then generated.

in any event, I'd be happy to submit a PR for this if LetsEncrypt wants to take it over for the reformatting. Otherwise, people can feel free to fork and work on it. If keeping a flat structure, the real customization will be in the output template (lines 193+).

input:

    {
    "_name": "ISRG Root X1",
    "type": "root",
    "status": "active",
    "crtsh": "https://crt.sh/?id=9314791",
    "txt": "https://letsencrypt.org/certs/isrgrootx1.txt",
    "pem": "https://letsencrypt.org/certs/isrgrootx1.pem",
    "der": "https://letsencrypt.org/certs/isrgrootx1.der",
    "signed_by": "https://letsencrypt.org/certs/isrgrootx1.pem",  # self-signed
    },

output:

    {
      "certificate": {
        "algorithm": "RSA", 
        "bits": 4096, 
        "cn": "ISRG Root X1", 
        "notAfter": "20350604110438Z", 
        "notBefore": "20150604110438Z", 
        "o": "Internet Security Research Group", 
        "selfsigned": true
      }, 
      "issuer": {
        "cn": "ISRG Root X1", 
        "o": "Internet Security Research Group", 
        "url_pem": "https://letsencrypt.org/certs/isrgrootx1.pem"
      }, 
      "status": "active", 
      "type": "root", 
      "urls": {
        "crtsh": "https://crt.sh/?id=9314791", 
        "der": "https://letsencrypt.org/certs/isrgrootx1.der", 
        "pem": "https://letsencrypt.org/certs/isrgrootx1.pem", 
        "txt": "https://letsencrypt.org/certs/isrgrootx1.txt"
      }
    },

aarongable commented 3 years ago

That's pretty cool! A couple notes:

Yes, the flat approach is definitely best. But IMO it should be a flat listing of public/private key pairs, and then the list of certificates corresponding to that keypair should be contained within each entry. This reflects the graph structure of the WebPKI, where nodes are keypairs and edges are certificates representing trust relationships between keypairs.
If we want to incorporate this or something like it into the website repo itself, it should be Go, rather than Python.
I'm pretty opposed to checking in auto-generated files. It would be best if the only file actually checked in to the repo is the human-editable input file, and the rest of the generation happens at Hugo build-time.

letsencrypt / website

Feature Request - Machine readable version of /certificates #1162