Closed simonw closed 2 years ago
There are two database files involved here.
index.db
is the database of raw OCR data pulled from s3-ocr index
. This can be generated from scratch by pulling content from the sfms-history
bucket, but that takes around 10 minutes as it needs to pull multiple GBs of JSON.
The s3-ocr index
command is smart enough to be able to only fetch new data, so it's OK to run provided the previous index.db
file is available.
The script generates sfms.db
which then gets deployed.
I could cache index.db
in the GitHub Actions cache but that expires. I'm going to store it in the sfms-history
S3 bucket, since that already exists and I already have credentials for it.
Steps to add:
index.db
from S3s3-ocr index sfms-history index.db
to update it with new OCR./build-db.sh
I'll need to add these to requirements.txt
:
sqlite-utils
s3-ocr
s3-credentials
(for the put-object
and get-object
commands)I'm adding two secrets to this repo:
S3_ACCESS_KEY
S3_SECRET_KEY
I created a valid blank SQLite database file with sqlite-utils create-database index.db
and uploaded that to the root of the sfms-history
bucket.
This was tough because of this bug, now fixed:
But... the deploy has gone out now, and /index.db
in the S3 bucket is an 18.5MB database!
https://sfms-history.vercel.app/docs now just lists the documents that we want to be there.
Currently it just downloads a previously built one - I want to run the
build-dbs.sh
script in GitHub Actions instead.Current: https://github.com/simonw/sfms-history/blob/547e277e7614a2c7ca4aa244afcf175715fede44/.github/workflows/deploy.yml#L29-L33