microbiomedata / issues

public repo for issues related to NMDC work
2 stars 1 forks source link

Continue checking schema examples. Strive for consistency between global definitions and slot_usages. #250

Open turbomam opened 1 year ago

turbomam commented 1 year ago

MIxS is notorious for including examples values that don't meet the loosely defined Value syntax, etc. Since the nmdc-schema actually enforces validation, we bear the responsibility for checking the examples.

@brynnz22 found that MIxS' source_mat_id example of 'MPI012345' doesn't match our submission-schema. In this case, 'MPI012345' is provided as an example for the global definition of source_mat_id but not as used by DhMultiviewCommonColumnsMixin. It the DhMultiviewCommonColumnsMixin usage that really matters for any validation rules, but it is confusing to have these two different examples.

I think I wrote sheets_and_friends such that it applies modifications in slot usages only. We can use yq to modify global definitions. But ideally we would only have one way of making the modifications.

Global definition:

source_mat_id:
  name: source_mat_id
  annotations:
    expected_value:
      tag: expected_value
      value: 'for cultures of microorganisms: identifiers for two culture collections; for other material a unique arbitrary identifer'
  description: A unique identifier assigned to a material sample (as defined by http://rs.tdwg.org/dwc/terms/materialSampleID, and as opposed to a particular digital record of a material sample) used for extracting nucleic acids, and subsequent sequencing. The identifier can refer either to the original material collected or to any derived sub-samples. The INSDC qualifiers /specimen_voucher, /bio_material, or /culture_collection may or may not share the same value as the source_mat_id field. For instance, the /specimen_voucher qualifier and source_mat_id may both contain 'UAM:Herps:14' , referring to both the specimen voucher and sampled tissue with the same identifier. However, the /culture_collection qualifier may refer to a value from an initial culture (e.g. ATCC:11775) while source_mat_id would refer to an identifier from some derived culture from which the nucleic acids were extracted (e.g. xatc123 or ark:/2154/R2).
  title: source material identifiers
  examples:
    - value: MPI012345
  from_schema: https://example.com/nmdc_submission_schema
  aliases:
    - source material identifiers
  is_a: nucleic acid sequence source field
  string_serialization: '{text}'
  slot_uri: MIXS:0000026
  multivalued: false
  range: string

As used in class DhMultiviewCommonColumnsMixin

source_mat_id:
  name: source_mat_id
  description: A globally unique identifier assigned to the biological sample.
  title: source material identifier
  todos:
    - Currently, the comments say to use UUIDs. However, if we implement assigning NMDC identifiers with the minter we dont need to require a GUID. It can be an optional field to fill out only if they already have a resolvable ID.
  notes:
    - The source material IS the Globally Unique ID
  comments:
    - Identifiers must be prefixed. Possible FAIR prefixes are IGSNs (http://www.geosamples.org/getigsn), NCBI biosample accession numbers, ARK identifiers (https://arks.org/). These IDs enable linking to derived analytes and subsamples. If you have not assigned FAIR identifiers to your samples, you can generate UUIDs (https://www.uuidgenerator.net/).
  examples:
    - value: IGSN:AU1243
    - value: UUID:24f1467a-40f4-11ed-b878-0242ac120002
  from_schema: https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/src/schema/nmdc
  rank: 2
  is_a: nucleic acid sequence source field
  string_serialization: '{text}:{text}'
  slot_uri: MIXS:0000026
  multivalued: false
  owner: Biosample
  domain_of:
    - Biosample
  slot_group: sample_id_section
  range: string
  pattern: '[^\:\n\r]+\:[^\:\n\r]+'
mslarae13 commented 5 months ago

@bmeluch and @aclum have found some too and started to discuss. I think this is a nmdc-schema issue, not submission-schema.

Edit: My opinions changed. we need examples for nmdc-schema and submission-schema. And example in submission-schema should exist in nmdc-schema and vice versa to show where validation gets more strict & to confirm that the strict-ness of the submission schema still satisfies schema.