kamatsuoka / goodtags

a tag app, but good
MIT License
6 stars 3 forks source link

[Offline search] Add a script to periodically fetch offline DB, host on GitHub Pages #7

Closed FuegoFro closed 7 months ago

FuegoFro commented 7 months ago

This is the first phase in modifying goodtags for ~instant offline search. At a high level, the cumulative set of changes will look like:

  1. Have something that periodically snapshots the canonical barbershoptags.com database into a SQL database suitable for use by the app and host that where the app can download it (this PR).
  2. Update the app to include the latest (at the time) database when compiling it and fetch the latest database on startup, only downloading it if it's actually newer than what the app already has.
  3. Update the app to add a parallel code path to direct searches to the database rather than the API.
  4. Once this has been validated, rip out the old code path and solely rely on the local database for searches.

This PR adds a Python script which will regularly (currently once a week) pull down the entirety of the Barbershop Tags database and put it into a SQL database suitable for consumption by the app (as mentioned, coming in a subsequent PR). I think by doing this in batches, weekly, and in the middle of a week night (in the US) we won't put too much strain on the database, though that's making a number of assumptions like that their peak load tends to be during the day in the US. Note that I'd considered only trying to fetch what's new since the last fetch, but since download count increments mark a row as "new", it ends up being the case that after a week almost half of the rows are new anyway so the added complexity of trying to do the fetch incrementally didn't seem worth it.

I do understand that it's a bit hard to verify the actual contents of the database currently as a reviewer since this completely lacks the code that consumes it, but this is based off of my local implementation of points 2 and 3 above. If GitHub had good cross-repo stacked PRs I'd put up the subsequent PRs at roughly the same time so it'd be easier to see them all, but as is it's probably easiest to assume the actual SQL contents/schema is reasonable and/or can easily change before it's widely in use.

The script then pushes the SQL database and a manifest (which can be used by the app to determine if the remote database is actually newer) to a gh-pages branch, which will then be used to host a GitHub Pages, well, page. GH Pages is used instead of raw.githubusercontent.com because it's behind a CDN which should give us better robustness and latency and it supports Accept-Encoding: gzip which should transparently and meaningfully cut down on transfer sizes of the DB since it compresses down pretty well (albeit only for certain extensions, hence the semi-hack of appending .otf to the database file name). Note another option for compression would be to use something like Brotli directly and then manually decompress on the client, but that seemed like a bit more work (though did reduce the compressed size from ~915k to ~600k).

Additionally this creates a GitHub Actions workflow which actually ends up triggering this script once a week (and allows manually running it, for testing).

Example: