Open cmungall opened 2 years ago
I like where this thought process is going to help figure out normalization across NMDC tools, thanks @cmungall.
I do think that a performant and intuitive portal could be in conflict with this goal, as you state with denormalizations and special cases etc. Translating a complex interlocking schema into the correct database joins and indices that were needed to get the portal functioning as it is was not trivial, so making this more automated seems fraught with dangers like long-lasting regressions as we spin up a new architecture. But there are likely several concrete and steps we can take toward normalization of terms, results, UI labels, etc. For the portal, schema normalization has not been given priority as of yet in favor of a nicely functioning portal and UI, but that's not to say we can't shift priorities.
@subdavis and others at times have considered whether something like Elasticsearch could handle faceted search more naturally, though as I understand Elasticsearch would probably require significant denormalization as well. Also it's unclear how the "millions of tiny objects" case of KEGG terms would function in this environment and whether it would be performant enough. Postgres is likely the technology we can tune the most for performance as compared to more generic stores like NoSQL or graph databases, though someone could prove to me otherwise.
Glad to continue the discussion!
Note: this is currently more of a discussion than a specific piece of work with a short term deliverable. This is something that we will likely have to do in the renewal.
The overall goal here is to have a single SOT, avoid implicit manual translations between different representations e.g. mongo and RDBMSs, and to be more agile - for example, we want to be able to add new sample fields to the schema, and have everything be driven by metadata about those fields:
I think we are already there with 1 and with @dwinston's schema for 2.
3 is harder - this is more challenging as we may need denormalizations and other divergences from the nmdc schema for the purposes of driving a website (though the premature optimization dictum should apply). I think there are some principled ways to do this.
There are some cases where we may want to simplify the schema - e.g. not every sample slot needs to be an AttributeValue - some things could be scalars
Other than this we can imagine a few mappings for sample slots. For single-valued:
All this assumes that we continue with an RDBMS longer term - this is a bigger issue outside the scope of this ticket, cc Shreyas