microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Relax certain `required: true` slots in schema #583

Closed sujaypatil96 closed 1 year ago

sujaypatil96 commented 1 year ago

There are a few slots on the Biosample class in the NMDC Schema, most of which are set as required: true in the schema.

This issue seeks to request the modification of the required: true constraint on certain slots on the Biosample class to recommended: true in order to accommodate the fetching of biosample records from upstream sources such as the GOLD database.

The following slots are the ones on which we are requesting the relaxation:

  1. env_broad_scale
  2. env_local_scale
  3. env_medium
  4. alt
  5. depth
  6. subsurface_depth

See https://github.com/microbiomedata/sample-annotator/pull/113 for more details.

mslarae13 commented 1 year ago

alt I agree, elevation is typically used for soil.

the env triad is required for data search

depth, if we're talking about soils should be required. It's a vital slot for data reuse (as is geographic location / lat lon & I'll suggest we make those required)

idk what subsurface depth is?

One thing to consider, currently, as the class Biosample sits, depth, while important for soil, sediment, and water.. isn't relevant for plants. We will need to think about "what is required for all biosamples" vs certain types. depth is really only required for certain types. All the bioscales plant & rhizosphere (maybe) samples won't have depth.

cmungall commented 1 year ago

Regarding the env triad, the choices in the general case are:

  1. relax schema and defer annotation, and have the samples not discoverable via certain search patterns in the interim
  2. keep schema script and force annotation prior to ingest

1 adds additional overhead in the need to perform updates using change sheets later.

2 adds some complexity to the ingest, in that we essentially have to merge two curation streams.

Note that in the specific case of BioScales, we need to merge two streams anyway. Here is the spreadsheet that we got from ORNL

https://docs.google.com/spreadsheets/d/1A6bynpzssAUpnDzoAQPZ-8L5HU2y3IuWX7mRiRrasgk/edit#gid=195687079

It includes the triad. It also includes other metadata we need to load.

sujaypatil96 commented 1 year ago

Apologies some of the slots I mentioned in the above list are not enforced as required: true, it's mostly just the envo triad, and I think you said that the envo triad being required is pretty essential so we won't make any changes to that.

ssarrafan commented 1 year ago

@sujaypatil96 moving to the next sprint but please let me know if you won't be actively working on it for the next few weeks.

sujaypatil96 commented 1 year ago

@ssarrafan we plan to address this at the metadata call today.

sujaypatil96 commented 1 year ago

After a brief discussion, @turbomam and I agree with point 2 from @cmungall: keep schema script and force annotation prior to ingest.

This approach is only possible in this case because ORNL has provided a supplementary file.

mslarae13 commented 1 year ago

see https://github.com/microbiomedata/nmdc-schema/issues/612

mslarae13 commented 1 year ago

leave envo required. we should be able to populate these for soil via gold addition of what stan provided. Then for other sample types, use envo -> gold mapping to fill out the envo slots.

aclum commented 1 year ago

We may have to revisit leaving all the envo slots as required true. Per Reddy envo terms don't exist for endosphere so env_broad_scale is populated but env_local_scale and env_medium are not for the bioscales endosphere samples. @mslarae13 @cmungall @emileyfadrosh

ssarrafan commented 1 year ago

@sujaypatil96 is this issue still being worked on? I'll move to the next sprint due to the current activity but let me know if It can be closed or if it needs to go to the backlog.

aclum commented 1 year ago

So far we've found workarounds for the environmental terms so those are still required for now.

sujaypatil96 commented 1 year ago

GOLD filled in missing values for the MIxS environmental triad for the BioScales project, so we've decided not to relax the schema, but to just leave it as is. So at the moment at least we don't need the changes from the original request of this issue so I think this issue can be closed.