Closed simonw closed 2 years ago
I synced Google Drive to S3 again (using the method I'll use for #15), then ran s3-ocr dedupe sfms-history
and s3-ocr start sfms-history --all
to ensure everything had been OCRd.
Now running this: https://github.com/simonw/sfms-history/blob/cc4d3675d58cbdf6ae676dc1794dfc9033e73269/build-db.sh#L3-L4
s3-ocr index sfms-history index.db
Will take a while because it needs to suck down all of that JSON.
Still need to exclude stuff like:
INTAKE/SFMShistory_intake_2022.03.13/scans_2022.03.13_membership
New decision: publish everything in INTAKE and PUBLIC, but exclude PROCESSED INTAKE DOCUMENTS subfolder.