jimmymathews commented 1 month ago

This issue is about doing a minor refactoring of the database organization and access pattern, to make it easier to retrieve the current "workload" across all datasets (the size of the computation job queues).

Background

Previously, and somewhat recently, we started using separate named databases for each dataset. This made it significantly easier, from the point of view of maintenance command complexity, as well significantly faster, to drop and replace one dataset without interfering with the other datasets, including active queries involving them.

However, after this change, it was not longer possible to do simple queries over all datasets. That was fine, but now we have one simple such cross-database query that we really want to do: counting up the task queues, for the purpose of informing autoscalers how to allocate resources.

It turns out that PostgreSQL has a feature that addresses this situation well: schemas.

These schemas are a unit of database organization ~~intermediate between the "cluster" (postgres process) and the "named database" (a single database in the standard SQL sense)~~ just below the "named database" level, contained in such a database, but acting in some ways like logical units separate from one another.

The proposal

The proposal is to make a new schema for each dataset, placing them all in the same named database once again. SELECT queries against all the datasets can be performed entriely in the normal way, as long as the schema-qualified names are used for the tables.

We would then place the study_lookup table in the "public" schema, available without qualification. We should perhaps change it slightly to reflect this change, database_name => schema_name.

This will require:

Altering the database access pattern (database_connection.py). Hopefully there is a way to select a schema in advance of a query, so that we do not have to update any current query strings that are hard-coded or dynamically created.
Altering the database setup script (probably).
Altering the per-dataset creation, to include schema creation.
Altering the per-dataset drop, to drop the schema.

I am not 100% sure that the benefits we previously gained -- fast, non-interfering whole-dataset drops and replacements -- will be retained, but I suspect they will be.

A side benefit of this change will be that cross-cutting queries will be possible again, for multi-dataset statistics.

jimmymathews commented 1 month ago

The final step will be the actual migration into this new system. So, completely new upload of all datasets into the new named database.

It is probably possible to do an efficient migration, moving from the current public schema into the new, but I think I would prefer to exercise the whole upload system (especially now that upload has slightly changed, to use S3 source files).

jimmymathews commented 1 month ago

This worked on in issue358.

jimmymathews commented 1 month ago

To finish this off I am adding an interactive uploader that shows what database credentials are available and dataset sources (local or S3), so that we can do this in a more organized way.

jimmymathews commented 1 month ago

The re-upload is complete and I tested the basics on each dataset. The migration worked. The job count query and the variant broken down by dataset both work.

nadeemlab / SPT

PostgreSQL schemas (allow simple all-datasets queries) #358

Background

The proposal