smallstep / certificates

🛡️ A private certificate authority (X.509 & SSH) & ACME server for secure automated certificate management, so you can use TLS everywhere & SSO for SSH.
https://smallstep.com/certificates
Apache License 2.0
6.73k stars 440 forks source link

Database cleanup procedure for ACME data #473

Open hlobit opened 3 years ago

hlobit commented 3 years ago

What would you like to be added

Wouldn't it be a great addition if there was a documented procedure or a CLI command to cleanup the DB from old ACME data ?

I would be happy to get a clue on how you would deal with DB data growth.

Why this is needed

We are using step-ca on production for more than 1 year now, and then our disk holds a badger database with ~40GB data. When using ACME challenges in the long run the database grows, and I found no relevant information about what could be done to get rid of old data (let's say all data for certificates issued before the last 3 months). I found the link to the export gist in the documentation as a starting point, but why would I stop the server and write Go code for such a task ?

I could migrate the DB from badger to mysql backend for easier maintenance, but even in that case, it seems to me there is no easy way to migrate without writing Go code...

maraino commented 3 years ago

Hi @hlobit, I don't think right now there's another alternative to do it yourself. But I can tell you that we've started to change how ACME stores its data, and you will be able to switch to a PostgreSQL or MySQL db, using actual columns rather than just blobs of data, and you will be able to do DELETE FROM.... I can't tell you for sure if there will be some kind of migration tool, but I think we should do it.

As an alternative right now would be to start from scratch, the main problems that you can find are:

I can't think of anything else, @dopey @mmalone: Is there any other problem that I'm forgetting?

dopey commented 3 years ago

@maraino You hit all the points that I wanted to suggest.

Mainly that starting from scratch is not a bad option if you're not heavily relying on passive revocation. In the future we should have better tooling around this, but for now we kinda have to kick the bucket down the road.

jodygilbert commented 3 years ago

Am I doing something wrong? I’ve started from fresh a couple of times, each time the client renewal fails as the existing accounts don’t exist. I fix this by deleting the account config on the clients, we only have a few so it’s manageable, but it’s not something I’d like to do as more clients as added.

hlobit commented 3 years ago

@maraino, @dopey thanks for your answers, I see there are improvements in the pipe for that ; it will be nice to use a postgres DB in a near future to deal ACME with data.

Am I doing something wrong? I’ve started from fresh a couple of times, each time the client renewal fails as the existing accounts don’t exist. I fix this by deleting the account config on the clients, we only have a few so it’s manageable, but it’s not something I’d like to do as more clients as added.

Exactly my case, I tried to drop the database, but I had to restore a backup because of renewals failing with error urn:ietf:params:acme:error:accountDoesNotExist. Unfortunately I cannot really afford going through all the servers to delete the account config.

Perhaps one option I have for now is to :

  1. stop step-ca
  2. extract the accounts from badger database with the script
  3. delete the DB
  4. start step-ca - it will create a new badger database
  5. stop step-ca
  6. add accounts from extract into it (this is the step for which I still have to write code)
  7. start step-ca

Does it sound good ? Is there something I should know before trying such a thing ?

dopey commented 3 years ago

Hey @hlobit while that process may work, hold on before you go into it. I think it may be easier to just go in and delete the tables that you don't need. Let me see if I can put something quick and dirty together and send it out. I'll test it out locally to verify it doesn't kill my system.

I forgot that most ACME users will be reusing the same one or multiple accounts. Our cli ACME client generates a new ACME account for each order. Wasteful, but hasn't caused an issue with our use case. Because of this I forget that most users are reusing accounts when they use the more popular clients.

dopey commented 3 years ago

import (
    "fmt"
    "os"

    "github.com/smallstep/nosql"
)

var tables = []string{
    "used_ott",
    "revoked_x509_certs",
    "x509_certs",
    //"acme_accounts",
    //"acme_keyID_accountID_index",
    "acme_authzs",
    "acme_challenges",
    "nonces",
    "acme_orders",
    "acme_account_orders_index",
    "acme_certs",
}

func main() {
    db, err := nosql.New("badgerV2", os.Args[1])
    if err != nil {
        panic(err)
    }
    defer db.Close()

    for _, table := range tables {
        fmt.Printf("Deleting table %s ... ", table)
        if err = db.DeleteTable([]byte(table)); err != nil {
            panic(err)
        }
        fmt.Printf("DONE\n")
    }
}

I've run this locally and confirmed that afterwards I'm able to create ACME certs using the same account (the same as I had been using before running the script). I tested with certbot and acme.sh clients.

My go.mod file for reference:

module github.com/smallstep/analyzedb

go 1.15

require github.com/smallstep/nosql v0.3.6
hlobit commented 3 years ago

Hi @dopey, thanks for this one, I was just trying on our side.

This is the output I get on first run:

$ go run main.go badger /var/lib/step-ca/db # NOTE: added db type as first argument
badger 2021/02/25 13:38:09 INFO: All 2 tables opened in 1.826s
badger 2021/02/25 13:38:09 INFO: Replaying file id: 36 at offset: 115580105
badger 2021/02/25 13:38:09 INFO: Replay took: 2.831µs
badger 2021/02/25 13:38:09 DEBUG: Value log discard stats empty
Deleting table used_ott ... DONE
Deleting table revoked_x509_certs ... DONE
Deleting table x509_certs ... DONE
Deleting table acme_authzs ... badger 2021/02/25 13:38:10 DEBUG: Storing value log head: {Fid:36 Len:48 Offset:124766054}
badger 2021/02/25 13:38:10 INFO: Got compaction priority: {level:0 score:1.73 dropPrefixes:[]}
badger 2021/02/25 13:38:10 INFO: Running for level: 0
badger 2021/02/25 13:38:11 DEBUG: LOG Compact. Added 788646 keys. Skipped 1383475 keys. Iteration took: 674.4041ms
badger 2021/02/25 13:38:12 DEBUG: Discard stats: map[20:6115498 31:5093738 12:6946515 36:2760233 27:5288313 8:8183012 0:33758043 17:6640883 7:8505598 26:5472171 2:13484581 24:5592385 25:5480210 34:4893918 3:11670059 21:6032593 9:7771137 16:6878233 10:7369091 6:8821142 5:9363227 14:6997917 30:5154220 33:5050093 1:17398098 19:6246176 15:6868270 28:5398966 22:5844049 32:5079518 23:5727909 18:6494691 4:10372913 29:5218942 11:7130651 13:6849187 35:6428219]
badger 2021/02/25 13:38:12 INFO: LOG Compact 0->1, del 3 tables, add 1 tables, took 1.797434034s
badger 2021/02/25 13:38:12 INFO: Compaction for level: 0 DONE
badger 2021/02/25 13:38:12 INFO: Force compaction on level 0 done
panic: table acme_authzs does not exist: not found

goroutine 1 [running]:
main.main()
        /root/smallstep-cleanup/main.go:34 +0x2a5
exit status 2

I have tried with all other tables, looks like all other acme_* tables and nonces were not existent on our database. Then I was fiddling on nosql repo, trying to get a list of the tables in the database, with no luck.

Maybe a different version of step-ca ?

$ dpkg -l step-certificates
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name              Version      Architecture Description
+++-=================-============-============-=================================
ii  step-certificates 0.13.3       amd64        Smallstep Certificate Authority

OK I finally could get the former list of acme tables from 0.13.3.

I restarted with remaining tables:

$ go run main.go badger /var/lib/step-ca/db.bak
badger 2021/02/25 14:19:27 INFO: All 1 tables opened in 325ms
badger 2021/02/25 14:19:27 INFO: Replaying file id: 36 at offset: 124766102
badger 2021/02/25 14:19:27 INFO: Replay took: 2.502µs
badger 2021/02/25 14:19:27 DEBUG: Value log discard stats empty
Deleting table acme-authzs ... DONE
Deleting table acme-challenges ... DONE
Deleting table nonce-table ... DONE
Deleting table acme-orders ... DONE
Deleting table acme-account-orders-index ... DONE
Deleting table acme-certs ... badger 2021/02/25 14:19:34 DEBUG: Flushing memtable, mt.size=67125166 size of flushChan: 0
badger 2021/02/25 14:19:34 DEBUG: Storing value log head: {Fid:36 Len:48 Offset:184597285}
DONE
badger 2021/02/25 14:19:35 DEBUG: Storing value log head: {Fid:36 Len:48 Offset:188783165}
badger 2021/02/25 14:19:36 INFO: Got compaction priority: {level:0 score:1.73 dropPrefixes:[]}
badger 2021/02/25 14:19:36 INFO: Running for level: 0
badger 2021/02/25 14:19:36 DEBUG: LOG Compact. Added 356 keys. Skipped 1576666 keys. Iteration took: 392.750991ms
badger 2021/02/25 14:19:36 DEBUG: Discard stats: map[4:29756929 8:23444705 24:15967058 30:14699701 22:16681690 6:25294544 14:20033863 1:50167797 25:15656813 33:14346321 7:24391716 3:33456461 19:17890392 27:15106872 20:17515141 34:13964930 5:26900990 23:16346451 13:19611197 2:38864074 35:13916825 36:5265953 26:15641569 31:14544522 28:15368901 18:18598615 9:22226782 29:14905923 17:19009101 21:17268389 0:97621498 32:14509760 16:19702017 10:21109069 12:19895864 15:19663978 11:20426744]
badger 2021/02/25 14:19:36 INFO: LOG Compact 0->1, del 3 tables, add 1 tables, took 426.902648ms
badger 2021/02/25 14:19:36 INFO: Compaction for level: 0 DONE
badger 2021/02/25 14:19:36 INFO: Force compaction on level 0 done

Fine ! I restarted step-ca, renewals still work, however it seems that the DB folder size hasn't changed...

$ du -h /var/lib/step-ca/db
37G     /var/lib/step-ca/db
dopey commented 3 years ago

Hey @hlobit, what version of Badger are you running? Looks like it may be V2. If so, then you should be able to install the badger cmd line tool and try out some operations like flatten/garbage-collection. I'm thinking the space that you freed just needs to be reclaimed via garbage collection.

hlobit commented 3 years ago

@dopey thanks for the clue.

Yeah I'm not familiar with badger. Running something like this multiple times after tables were deleted did the trick:

package main

import (
        "fmt"
        "os"

        "github.com/dgraph-io/badger"
)

func main() {
        opts := badger.DefaultOptions(os.Args[1])
        opts.ValueDir = os.Args[1]
        db, err := badger.Open(opts)
        if err != nil {
                panic(err)
        }
        defer db.Close()

again:
        if err := db.RunValueLogGC(0.7); err == nil {
                fmt.Println("success")
                goto again
        } else {
                fmt.Println(err)
        }
}

Now I have the expected outcome, I'm interested in a way to automate it:

  1. according to badger docs, RunValueLogGC can be run while another process is using the DB. So it seems to be OK to automate it periodically.
  2. I guess deleting tables is not something that could be done without stopping step-ca service. If this is true, what would be the options available to start a kind of cleanup without stopping the service ?

Thanks for your help :+1:

tashian commented 1 year ago

My hunch is that even for higher-volume CAs, a GC routine that runs periodically and by default would have a negligible impact on load. (And, at the volume where it does have a big impact, the recommendation is simple: Switch to mysql or postgres. We should consider making this our recommendation for high-volume installs anyway.)

However, migrating existing folks to a periodic GC—for CAs that have been running for a while and have a lot of GC backlog—is where trouble could arise, and we’d just have to get the comms right around that and give people a migration process. Maybe we say, shut down the CA and run a program that does full GC in this case.

Ultimately, because GC is an online operation for Badger, and because high-volume CAs have mysql and postgres as an alternative option, I think Badger GC should be on by default and not be a separate URL that they have to hit.

maraino commented 1 year ago

FYI: badger already provides a cli to do a gc, backup, restore, ...

The CA should be stopped to do for example a GC, and the tool might need to be the one specific for the version of badger used.