Domain Discovery / Minimal Certificate Transparency Log solution

bwbroersma commented 1 year ago

Currently the crt.sh is unstable to use (500 errors). Which means we have to push it to background and cannot show the direct impact of adding the CT log subdomains.

Are there solutions to monitor a CT log server and just log the domain names (not all the crypto / etc.)

Cloning a CT log server would be 1+TB, so that's a bit large (reference: https://letsencrypt.org/2019/11/20/how-le-runs-ct-logs.html#database).

Some links for how the CT log API works: https://security.stackexchange.com/a/167373

e.g. https://oak.ct.letsencrypt.org/2023/ct/v1/get-sth https://oak.ct.letsencrypt.org/2023/ct/v1/get-entries?start=1000&end=1014

some CT log servers:

curl -sSfA '' 'https://www.gstatic.com/ct/log_list/v3/log_list.json' | jq '.operators[].logs[]|select(.state.retired==null)|.url' -r
https://ct.googleapis.com/logs/argon2023/
https://ct.googleapis.com/logs/us1/argon2024/
https://ct.googleapis.com/logs/xenon2023/
https://ct.googleapis.com/logs/eu1/xenon2024/
https://ct.cloudflare.com/logs/nimbus2023/
https://ct.cloudflare.com/logs/nimbus2024/
https://yeti2024.ct.digicert.com/log/
https://yeti2025.ct.digicert.com/log/
https://nessie2023.ct.digicert.com/log/
https://nessie2025.ct.digicert.com/log/
https://sabre.ct.comodo.com/
https://oak.ct.letsencrypt.org/2023/
https://oak.ct.letsencrypt.org/2024h1/
https://oak.ct.letsencrypt.org/2024h2/
https://ct.trustasia.com/log2023/
https://ct2024.trustasia.com/log2024/

E.g. something like:

CTLOG='https://oak.ct.letsencrypt.org/2023/' \
curl -sSfA '' "${CTLOG}ct/v1/get-entries?start=[256000000-256000512:256]&end=9999999999" \
| jq -r \
'.entries[]'\
'|(.leaf_input[12:16]|@base64d[1:3]) as $t'\
'| if $t=="\u0000\u0000" then .leaf_input[20:]'\
'  elif $t=="\u0000\u0001" then .extra_data[4:]'\
'  else empty end'\
| while read -r cert; \
  do \
    echo "$cert" \
    | base64 -d \
    | openssl x509 -inform DER -noout -ext=subjectAltName \
    | sed -rn 's/ *IP Address:[0-9:.]+,? ?//g;s/ *DNS:([^ ,]+),? ?/\1;/g;s/;(.)/\n\1/g;s/;$//gp'\
    | awk -F. '{for(i=NF;i>0;i--)printf "%s.",$i;print""}'\
    | grep -v '^org.letsencrypt.testing.woodpecker.';\
  done

I queried from 256000000 to 256256000, so 1000 requests and 256000 entries, this resulted in 541100 domains (13.28MiB/3.32MiB), and 402560 unique entries (9.84MiB/2.33MiB). My main issue was CPU in jq!

'It seems to work' for some sample cases, although this should not be used in production*.

Other tools:

Mostly use other API's
Don't properly aligning the batch requests
Don't parse pre-certificates
Use a lot of data/storage

Todo:

[ ] Are there already tools to transform CT logs into some parse-able data stream?
- https://github.com/CaliDog/Axeman
- https://github.com/google/certificate-transparency-go/blob/master/scanner/fetcher.go

[x] ~_Find out max records get-entries supports per CT log (certificate-transparency groups 2020 discussion)_~

curl -sSfA '' 'https://www.gstatic.com/ct/log_list/v3/log_list.json' \
| jq '.operators[].logs[]|select(.state.retired==null)|.url' -r \
| while read -r CTLOG; \
  do \
    echo "$CTLOG"|sed -r 's@.*(\.|/)([a-z0-9-]+\.[a-z]+)/.*@\2@g' \
    | tr '\n' '\t'; \
    curl -GsSfA '' "${CTLOG}ct/v1/get-entries" --data-urlencode 'start=0' --data-urlencode "end=$(curl -sSfA '' "${CTLOG}ct/v1/get-sth" | jq .tree_size)" | jq '.entries|length'; \
  done \
| sort \
| uniq \
| sort -k2,2n -k1,1

CT Log	batch size
googleapis.com	32
comodo.com	256
digicert.com	256
letsencrypt.org	256
trustasia.com	256
cloudflare.com	1024

Also need to align: https://community.letsencrypt.org/t/enabling-coerced-get-entries/114436

[ ] Check retry/fail logic
[ ] Find out monitoring certificates to ignore
- ct-woodpecker
  
  * Note that this is quite hacky code, since jq is not the best tool to do binary (chars != bytes, since jq has unicode support). The leaf_input is of the MerkleTreeLeaf structure. So: byte 0 is version, byte 1 is MerkleLeafType, byte 2..9 is timestamp, byte 10..11 is LogEntryType and should be \x00\x00 for a x509_entry. Both leaf_input and extra_data then have a 3 byte length field, that can be skipped over. Because it aligns on 15 bytes and 3 bytes × 8 bit / 6 bit base64 => 15×8/6=20 base64 chars, 3×8/6=4 base64 chars, we can directly operate on the base64 string to skip these bytes. One can also use dd bs=4096 skip=15 iflag=skip_bytes status=none for the X509 entries and dd bs=4096 skip=3 iflag=skip_bytes status=none for the PreCertificates. For debug: openssl asn1parse -inform der -i.

See https://datatracker.ietf.org/doc/html/rfc6962#section-3.4 Structure of the Merkle Tree input:

       enum { x509_entry(0), precert_entry(1), (65535) } LogEntryType;

       enum { timestamped_entry(0), (255) }
         MerkleLeafType;

       struct {
           uint64 timestamp;
           LogEntryType entry_type;
           select(entry_type) {
               case x509_entry: ASN.1Cert;
               case precert_entry: PreCert;
           } signed_entry;
           CtExtensions extensions;
       } TimestampedEntry;

       struct {
           Version version;
           MerkleLeafType leaf_type;
           select (leaf_type) {
               case timestamped_entry: TimestampedEntry;
           }
       } MerkleTreeLeaf;

Ideally we would just have a compressed / suffix trie datastructure with the (reversed) Fully qualified domain name (FQDN): The hierarchy of labels in a fully qualified domain name.

baknu commented 1 year ago

Just for documentation puposes: Scanning CT logs is a huge step forward. However, note that this way the dashboard will not discover:

subdomains that do not have A/AAAA record, but that are used for other puposes (e.g. subdomains with an MX record and no A/AAAA record);
subdomains that do have A/AAAA record but do not have a certificate (i.e. http-subdomains);
subdomains that are covered by a wildcard ("*") certificate.

bwbroersma commented 1 year ago

Thanks @baknu, very true, this mainly benefits the web test, not the mail test. The records with only a MX record or mailing-only domains without a TLS certificate will not be found.

stitch commented 8 months ago

An issues called 'Limit max domains via certificate transparency' can be merged into this issue as this is something to keep in mind when working with this. Getting 1000 subdomains for 1000 domains in your list is fun, but not supported and requires several other optimizations. So knowing in advance how many subdomains might be found / are limited. Or allowing users to cherrypick subdomains etc would make this feature more 'workable'.

bwbroersma commented 6 months ago

Extra note about crt.sh: https://crt.sh/atom is pretty nice since it is XML instead of HTML. The only issue is the response code 429 / rate limiting.

bwbroersma commented 3 months ago

New notes about sql.sh: they allow direct PostgreSQL read access to their database¹. The access is:

$ psql -h crt.sh -p 5432 -U guest certwatch

The schema can be found here https://github.com/crtsh/certwatch_db/ And there is also a showSQL=Y query parameter to show the SQL executed, e.g. see: https://crt.sh/?q=internet.nl&showSQL=Y&exclude=expired

Some rate limits apply: it's limited to 5 connections per IP and still regularly gives:

ERROR:  canceling statement due to statement timeout

Therefor it's probably best to create some daily dump with new seen Precertificates & Leaf certificates (note see some stats about the crt.sh fill ratio of known certificate serials, because of this, both should be parsed). So maybe an idea would be to have a daily job execute the psql command with -t -A -F"," -c "SELECT ...;" to output the data in CSV-format, then this can be compressed by a separate other program to a efficient structure.

¹ it seems to be a hot-standby, because of the errors (see stack overflow):

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

Why I did not know of this (since this is like forever available, at least more than 5 years) .. maybe I would have discovered it earlier if I would by default port scan hostnames I visit ;)

stitch commented 5 days ago

Added a first version to the dashboard. Pending infrastructure changes to get this running on the server.

See: https://github.com/internetstandards/Internet.nl-ct-log-subdomain-suggestions-api

internetstandards / Internet.nl-dashboard

Domain Discovery / Minimal Certificate Transparency Log solution #434

Todo:

ct-woodpecker