Derive the postgresql schema more directly from the NMDC schema

microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model

Creative Commons Zero v1.0 Universal

27 stars 8 forks source link

Note: this is currently more of a discussion than a specific piece of work with a short term deliverable. This is something that we will likely have to do in the renewal.

The overall goal here is to have a single SOT, avoid implicit manual translations between different representations e.g. mongo and RDBMSs, and to be more agile - for example, we want to be able to add new sample fields to the schema, and have everything be driven by metadata about those fields:

data harmonizer
API
portal facets

I think we are already there with 1 and with @dwinston's schema for 2.

3 is harder - this is more challenging as we may need denormalizations and other divergences from the nmdc schema for the purposes of driving a website (though the premature optimization dictum should apply). I think there are some principled ways to do this.

There are some cases where we may want to simplify the schema - e.g. not every sample slot needs to be an AttributeValue - some things could be scalars

Other than this we can imagine a few mappings for sample slots. For single-valued:

normalized - each AttributeValue object is stored in a separate table, and the sample table has a FK (my preference is natural keys). E.g. sample_en_broad takes an ENVO ID, and is a FK to a OntologyTerm table keyed by that ID, following the OntologyTerm class in the nmdc schema
denormalized - each AttributeValue slot is lifted up to the sample label and prefixed, e.g. sample_env_broad_id, sample_env_broad_label

All this assumes that we continue with an RDBMS longer term - this is a bigger issue outside the scope of this ticket, cc Shreyas

I like where this thought process is going to help figure out normalization across NMDC tools, thanks @cmungall.

I do think that a performant and intuitive portal could be in conflict with this goal, as you state with denormalizations and special cases etc. Translating a complex interlocking schema into the correct database joins and indices that were needed to get the portal functioning as it is was not trivial, so making this more automated seems fraught with dangers like long-lasting regressions as we spin up a new architecture. But there are likely several concrete and steps we can take toward normalization of terms, results, UI labels, etc. For the portal, schema normalization has not been given priority as of yet in favor of a nicely functioning portal and UI, but that's not to say we can't shift priorities.

@subdavis and others at times have considered whether something like Elasticsearch could handle faceted search more naturally, though as I understand Elasticsearch would probably require significant denormalization as well. Also it's unclear how the "millions of tiny objects" case of KEGG terms would function in this environment and whether it would be performant enough. Postgres is likely the technology we can tune the most for performance as compared to more generic stores like NoSQL or graph databases, though someone could prove to me otherwise.

Glad to continue the discussion!

microbiomedata / nmdc-schema

Derive the postgresql schema more directly from the NMDC schema #302