Closed jimmymathews closed 1 month ago
The final step will be the actual migration into this new system. So, completely new upload of all datasets into the new named database.
It is probably possible to do an efficient migration, moving from the current public schema into the new, but I think I would prefer to exercise the whole upload system (especially now that upload has slightly changed, to use S3 source files).
This worked on in issue358
.
To finish this off I am adding an interactive uploader that shows what database credentials are available and dataset sources (local or S3), so that we can do this in a more organized way.
The re-upload is complete and I tested the basics on each dataset. The migration worked. The job count query and the variant broken down by dataset both work.
This issue is about doing a minor refactoring of the database organization and access pattern, to make it easier to retrieve the current "workload" across all datasets (the size of the computation job queues).
Background
Previously, and somewhat recently, we started using separate named databases for each dataset. This made it significantly easier, from the point of view of maintenance command complexity, as well significantly faster, to drop and replace one dataset without interfering with the other datasets, including active queries involving them.
However, after this change, it was not longer possible to do simple queries over all datasets. That was fine, but now we have one simple such cross-database query that we really want to do: counting up the task queues, for the purpose of informing autoscalers how to allocate resources.
It turns out that PostgreSQL has a feature that addresses this situation well: schemas.
These schemas are a unit of database organization
intermediate between the "cluster" (postgres process) and the "named database" (a single database in the standard SQL sense)just below the "named database" level, contained in such a database, but acting in some ways like logical units separate from one another.The proposal
The proposal is to make a new schema for each dataset, placing them all in the same named database once again.
SELECT
queries against all the datasets can be performed entriely in the normal way, as long as the schema-qualified names are used for the tables.We would then place the
study_lookup
table in the "public" schema, available without qualification. We should perhaps change it slightly to reflect this change,database_name
=>schema_name
.This will require:
I am not 100% sure that the benefits we previously gained -- fast, non-interfering whole-dataset drops and replacements -- will be retained, but I suspect they will be.
A side benefit of this change will be that cross-cutting queries will be possible again, for multi-dataset statistics.