pulibrary / dpul-collections

An inspiring environment for global communities to engage with diverse digital collections
1 stars 0 forks source link

Swap in a new full index #104

Open hackartisan opened 1 month ago

hackartisan commented 1 month ago

Sometimes we will need to generate a new index from empty.

Scenario 1: We discover a bug that has caused us to index records that shouldn't be publicly viewable. The records themselves haven't changed, but when we fix the bug we need to create a new clean index.

Scenario 2: We need to migrate to a new major version of solr, which requires us to create a new clean index.

We'd like to be able to accommodate these scenarios without downtime. Solr provides some tooling for this (e.g. the solr api, aliases), or we can add configuration to local application.

Acceptance criteria

First Step

Look at the scripts in pdc_describe - they do a process like this.

Also play around with it.

hackartisan commented 2 days ago

The indexing pipeline needs to know its cache version and the name of the collection it's writing to. We'll use the collection alias for reads. This way we can write to two different collections at once (using their actual names) while allowing reads to switch via a task (by moving the alias via the solr api).

So starting a new indexing pipeline with a new cache version will mean updating configuration and deploying. application code should start a broadway pipeline for each entry in a list of {cache version, index collection} tuples or what have you. First step in starting the pipeline should be to create the collection if it doesn't exist.* If code ever needs to know which collection is active it can ask solr to resolve the alias (likely will need to know this when checking whether to swap). Code can automate swapping the index by periodically seeking to swap to the collection fed by the pipeline with the highest cache_version.

Cleanup would then be driven by a human. Update configs to remove the old pipeline / collection name. maybe run a task that deletes the old collection (and the database entries for the old pipeline, as well?) code can deduce old collections / cache entries by checking configured cache version values and collection names.

* maybe even create the alias if it doesn't exist yet, for bootstrapping a new environment entirely.