microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Modify `Study` to account for data collection consortia, while reusing infrastructure #1101

Closed turbomam closed 1 year ago

turbomam commented 1 year ago

see also

cc @brynnz22

turbomam commented 1 year ago

For this issue->branch->PR: leave the class that has modeled studies and will now also model data collection consortia named as Study. That will retain all of the external integration with MongoDB, the API, and the DataPortal.

Add a required slot that takes enumerated values that distinguish between true hypothesis driven studies and data/sample collection motivated research consortia.

The addition of that new required new slot will require a database migration. Kitware/nmdc-server will have to start interrogating the new slot in order to determine whether to draw a study page or a consortium page, but otherwise the impact to the NMDC ecosystem should be minimal.

Additional migration-free possibilities:

turbomam commented 1 year ago

uses slot initiative_type and enum InitiativeTypeEnum to drive the differentiation.

Let's reassess those names for the right level of specificity and also double check the preferred capitalization for enums and PermisibleValues. I always forget and have set a bad example of inconsistency.

see

We can also check the annotations an mappings for those new elements.

turbomam commented 1 year ago

Regarding naming, do we want to paint ourselves into a corner of never being able to use initiative_type and InitiativeTypeEnum in any other context?

We could name the enum ResearchInitiativeTypeEnum and then allow different enums if initiative_type needs to be used in other classes in the future.

Possible disadvantages? My proposed enum name is longer. Anything else?

turbomam commented 1 year ago

We could give some thought to what the real parent class of Study is, whether that class is currently modeled in the nnmdc-schema or not yet.

Is a process? Or a group of people? Something else?

turbomam commented 1 year ago

What slots are allowed on EnumDefinition? We don't have to use them all!

turbomam commented 1 year ago

I'd like to start micro-crediting, so that people who have contributed will get credit even if they don't make the PR.

I think this will require making an ORCID prefix. https://orcid.org/ ?

@brynnz22 , what's your ORCID? I added two in this branch that look like they may be yours.

caveat

turbomam commented 1 year ago

We should be in the habit of updating not just he valid examples, but also the invalid examples. Each invalid example should illustrate one single deviations from the requirements, and the file name should state that deviation.

As new constraints are put on classes, those shouldn't become additional deviations, but rather one new, well-named invalid data file should be created to illustrate the new constraint.

turbomam commented 1 year ago

A limitation of this "classifying by enum slot" approach is that it implies that there will only be one axis by which Study-like things can be subClassed. If a Study and a CollectionConsortium are different things, then they really deserve their own classes, which can accommodate differences we discover in the future.

We have violated this principle in the past to some degree especially with the Biosample class. One could say that was an acceptable short-term solution because creating Biosample subClasses for samples from each environment would have required making many more MongoDB collections.

If we chose to make a CollectionConsortium class as a sibling of Study, that only involves making one new MongoDB collection. I actually don't like that much either, and that is why I continue to advocate for a collection-less/table-less storage like a RDF triplestore.

We also theoretically have the option of storing both studies and consortia in the one collection, as long as they are given a formal type slot, whose value would be the CURIe of the most specific class that they instantiate. That has been intended all along by giving the type class the https://linkml.io/linkml-model/latest/docs/designates_type/ decorator, and slot_uri: rdf:type, but I haven't gotten a test case to work yet.

In summary, I would like to take the permissible value annotations from #1097, migrate them to class annotations in #1104, and close this issue and it's PR.

If we do decide that an enumeration slot is the best way to differentiate between studies and consortia, I would like to refrain from merging that until we do an audit of other '...type' slots in the schema. It's a mess and I don't want to add to it.