microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

Biosample metadata ingest pipelines should produce output compatible with submission portal #193

Open pkalita-lbl opened 1 year ago

pkalita-lbl commented 1 year ago

This is the data format required for an entry in the submission portal:

{
  // Currently this holds a string like "soil", "soil_jgi_mg", "air", etc that is the name of a
  // class in the submission schema. There is in-progress work in the submission portal
  // that may change to an array of class names instead. 
  "template": string,

  // Information from Study Form screen in the submission portal
  "studyForm": {
    "notes": string,
    "piName": string,
    "piEmail": string,
    "piOrcid": string,
    "studyDate": null, // I believe this is not used anymore
    "studyName": string,
    "description": string,
    // roles should be from https://credit.niso.org/
    "contributors": { "name": string, "orcid": string, roles: string[] }[], 
    "linkOutWebpage": string[]
  },

  // This is the data from the DataHarmonizer view. Currently it is an array-of-arrays 
  // representing the rows and columns of data. There is in-progress work in the 
  // submission portal that *will* change the format of this in order to conform to
  // a class in the submission schema
  "sampleData":  ...

  // The option selected in the Environment Package screen in the submission portal
  "packageName": string,

  // Information from the Multiomics Data screen in the submission portal
  "multiOmicsForm": {
    "JGIStudyId": string,
    "datasetDoi": string,
    "GOLDStudyId": string,
    "studyNumber": string,
    "NCBIBioProjectId": string,
    "alternativeNames": string[],
    "NCBIBioProjectName": string,
    // corresponds to the checkboxes in the form. Valid values are: mg-jgi, mt-jgi, mb-jgi,
    // mp-emsl, mb-emsl, nom-emsl, mg, mt, mp, mb, nom
    "omicsProcessingTypes": string[] 
  }
turbomam commented 1 year ago

@jeffbaumes

In order to align the nmdc-schema, the modeling above and SubmissionMetadata, I would like to normalize names so that

see work in progress:

I will make explicit proposals

pkalita-lbl commented 1 year ago

I can't believe I didn't see this before. See also: https://github.com/microbiomedata/nmdc-server/blob/main/nmdc_server/schemas_submission.py

aclum commented 1 year ago

One example Mark gave was the envo triad terms, the nmdc schema doesn't enforce anything but submission portal requires a regex match to ENVO.

turbomam commented 1 year ago

Apologies for hair splitting: In NMDC we have informally called env_broad_scale, env_local_scale and env_medium the "MIxS environmental triad". That's not formalized anywhere, but it sure is easier than saying "env_broad_scale, env_local_scale and env_medium"

We should probably formalize it as a slot_group or subset

I do not believe that MIxS or ENVO have any term to denote these three slots

Several of us have used phrases like 'envo triad terms' from time to time. We should stop doing that, especially since we are starting to anticipate the use on non-ENVO terms in these slots. Plant Ontology/PO for example.

turbomam commented 1 year ago

See