internetstandards / Internet.nl-dashboard

Application that creates a dashboard for scans using the Internet.nl API.
Apache License 2.0
7 stars 11 forks source link

Domain Discovery / Minimal Certificate Transparency Log solution #434

Closed bwbroersma closed 5 days ago

bwbroersma commented 1 year ago

Currently the crt.sh is unstable to use (500 errors). Which means we have to push it to background and cannot show the direct impact of adding the CT log subdomains.

Are there solutions to monitor a CT log server and just log the domain names (not all the crypto / etc.)

Cloning a CT log server would be 1+TB, so that's a bit large (reference: https://letsencrypt.org/2019/11/20/how-le-runs-ct-logs.html#database).

Some links for how the CT log API works: https://security.stackexchange.com/a/167373

e.g. https://oak.ct.letsencrypt.org/2023/ct/v1/get-sth https://oak.ct.letsencrypt.org/2023/ct/v1/get-entries?start=1000&end=1014

some CT log servers:

curl -sSfA '' 'https://www.gstatic.com/ct/log_list/v3/log_list.json' | jq '.operators[].logs[]|select(.state.retired==null)|.url' -r
https://ct.googleapis.com/logs/argon2023/
https://ct.googleapis.com/logs/us1/argon2024/
https://ct.googleapis.com/logs/xenon2023/
https://ct.googleapis.com/logs/eu1/xenon2024/
https://ct.cloudflare.com/logs/nimbus2023/
https://ct.cloudflare.com/logs/nimbus2024/
https://yeti2024.ct.digicert.com/log/
https://yeti2025.ct.digicert.com/log/
https://nessie2023.ct.digicert.com/log/
https://nessie2025.ct.digicert.com/log/
https://sabre.ct.comodo.com/
https://oak.ct.letsencrypt.org/2023/
https://oak.ct.letsencrypt.org/2024h1/
https://oak.ct.letsencrypt.org/2024h2/
https://ct.trustasia.com/log2023/
https://ct2024.trustasia.com/log2024/

E.g. something like:

CTLOG='https://oak.ct.letsencrypt.org/2023/' \
curl -sSfA '' "${CTLOG}ct/v1/get-entries?start=[256000000-256000512:256]&end=9999999999" \
| jq -r \
'.entries[]'\
'|(.leaf_input[12:16]|@base64d[1:3]) as $t'\
'| if $t=="\u0000\u0000" then .leaf_input[20:]'\
'  elif $t=="\u0000\u0001" then .extra_data[4:]'\
'  else empty end'\
| while read -r cert; \
  do \
    echo "$cert" \
    | base64 -d \
    | openssl x509 -inform DER -noout -ext=subjectAltName \
    | sed -rn 's/ *IP Address:[0-9:.]+,? ?//g;s/ *DNS:([^ ,]+),? ?/\1;/g;s/;(.)/\n\1/g;s/;$//gp'\
    | awk -F. '{for(i=NF;i>0;i--)printf "%s.",$i;print""}'\
    | grep -v '^org.letsencrypt.testing.woodpecker.';\
  done

I queried from 256000000 to 256256000, so 1000 requests and 256000 entries, this resulted in 541100 domains (13.28MiB/3.32MiB), and 402560 unique entries (9.84MiB/2.33MiB). My main issue was CPU in jq!

'It seems to work' for some sample cases, although this should not be used in production*.

Other tools:

Todo:

See https://datatracker.ietf.org/doc/html/rfc6962#section-3.4 Structure of the Merkle Tree input:

       enum { x509_entry(0), precert_entry(1), (65535) } LogEntryType;

       enum { timestamped_entry(0), (255) }
         MerkleLeafType;

       struct {
           uint64 timestamp;
           LogEntryType entry_type;
           select(entry_type) {
               case x509_entry: ASN.1Cert;
               case precert_entry: PreCert;
           } signed_entry;
           CtExtensions extensions;
       } TimestampedEntry;

       struct {
           Version version;
           MerkleLeafType leaf_type;
           select (leaf_type) {
               case timestamped_entry: TimestampedEntry;
           }
       } MerkleTreeLeaf;

Ideally we would just have a compressed / suffix trie datastructure with the (reversed) Fully qualified domain name (FQDN): The hierarchy of labels in a fully qualified domain name.

baknu commented 1 year ago

Just for documentation puposes: Scanning CT logs is a huge step forward. However, note that this way the dashboard will not discover:

  1. subdomains that do not have A/AAAA record, but that are used for other puposes (e.g. subdomains with an MX record and no A/AAAA record);
  2. subdomains that do have A/AAAA record but do not have a certificate (i.e. http-subdomains);
  3. subdomains that are covered by a wildcard ("*") certificate.
bwbroersma commented 1 year ago

Thanks @baknu, very true, this mainly benefits the web test, not the mail test. The records with only a MX record or mailing-only domains without a TLS certificate will not be found.

stitch commented 8 months ago

An issues called 'Limit max domains via certificate transparency' can be merged into this issue as this is something to keep in mind when working with this. Getting 1000 subdomains for 1000 domains in your list is fun, but not supported and requires several other optimizations. So knowing in advance how many subdomains might be found / are limited. Or allowing users to cherrypick subdomains etc would make this feature more 'workable'.

bwbroersma commented 6 months ago

Extra note about crt.sh: https://crt.sh/atom is pretty nice since it is XML instead of HTML. The only issue is the response code 429 / rate limiting.

bwbroersma commented 3 months ago

New notes about sql.sh: they allow direct PostgreSQL read access to their database¹. The access is:

$ psql -h crt.sh -p 5432 -U guest certwatch

The schema can be found here https://github.com/crtsh/certwatch_db/ And there is also a showSQL=Y query parameter to show the SQL executed, e.g. see: https://crt.sh/?q=internet.nl&showSQL=Y&exclude=expired

Some rate limits apply: it's limited to 5 connections per IP and still regularly gives:

ERROR:  canceling statement due to statement timeout

Therefor it's probably best to create some daily dump with new seen Precertificates & Leaf certificates (note see some stats about the crt.sh fill ratio of known certificate serials, because of this, both should be parsed). So maybe an idea would be to have a daily job execute the psql command with -t -A -F"," -c "SELECT ...;" to output the data in CSV-format, then this can be compressed by a separate other program to a efficient structure.


¹ it seems to be a hot-standby, because of the errors (see stack overflow):

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

Why I did not know of this (since this is like forever available, at least more than 5 years) .. maybe I would have discovered it earlier if I would by default port scan hostnames I visit ;)

stitch commented 5 days ago

Added a first version to the dashboard. Pending infrastructure changes to get this running on the server.

See: https://github.com/internetstandards/Internet.nl-ct-log-subdomain-suggestions-api