microbiomedata / submission-schema

https://microbiomedata.github.io/submission-schema/
MIT License
1 stars 1 forks source link

176 review srcdata paths and remove everything besides valid and invalid if possible #180

Open turbomam opened 5 months ago

turbomam commented 5 months ago
turbomam commented 5 months ago

.../SampleData-jgi_mg_data-colon-dna_sample_name.yaml

_in src/data/unexpectedpass/

jgi_mg_data:
  - analysis_type:
      - metagenomics
      - metatranscriptomics
    dna_concentration: 1.23
    dna_cont_type: plate
    dna_cont_well: C3
    dna_container_id: xxx
    dna_dnase: "no"
    dna_isolate_meth: xxx
    dna_project_contact: xxx
    dna_samp_id: xxx
    dna_sample_format: DNAStable
    dna_sample_name: DNA:0546789 # contains colon and should not be allowed 'a-z, A-Z, 0-9, - and _ only'
    dna_seq_project: xxx
    dna_seq_project_name: xxx
    dna_seq_project_pi: xxx
    dna_volume: 0
    proposal_dna: xxx
    samp_name: xxx

The slot definition in src/nmdc_submission_schema/schema/nmdc_submission_schema.yaml mentions the legal characters for dna_sample_name, but doesn't actually define a pattern

      dna_sample_name:
        name: dna_sample_name
        description: Give the DNA sample a name that is meaningful to you. Sample names must be unique across all JGI projects and contain a-z, A-Z, 0-9, - and _ only.
        title: DNA sample name
        examples:
          - value: JGI_pond_041618
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 4
        string_serialization: '{text}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metagenomics_section
        range: string
        required: true
        recommended: true
        multivalued: false
turbomam commented 5 months ago

.../SampleData-jgi_mt_data-illegal-rna_seq_project.yaml

same logic for

jgi_mt_data:
  - analysis_type:
      - metatranscriptomics
    dnase_rna: "no"
    proposal_rna: '504000'
    rna_concentration: 100.3
    rna_cont_type: plate
    rna_cont_well: H7
    rna_container_id: cat_ht_1999
    rna_isolate_meth: phenol/cloroform extraction
    rna_project_contact: Leslie Ann Levine
    rna_samp_id: '123456'
    rna_sample_format: MDA reaction buffer
    rna_sample_name: JGI_lagoon_14343
    rna_seq_project: An RNA Sequencing Project # Values will be prefilled by NMDC. Appears to accept any string.
    rna_seq_project_name: JGI Lagoon metatranscritpomics
    rna_seq_project_pi: Patty Smith
    rna_volume: 25.1
    samp_name: sample name
    source_mat_id: MPI:012345

But the schema doesn't actually prevent users from entering data in any of those slots. I believe there may be some separate issue about "locking" some slots/columns, but the schema doesn't know about that.

      rna_seq_project:
        name: rna_seq_project
        title: RNA seq project ID
        comments:
          - Do not edit these values. A template will be provided by NMDC in which these values have been pre-filled.
        examples:
          - value: '1191234'
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 1
        string_serialization: '{text}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metatranscriptomics_section
        range: string
        required: true
        recommended: true
        multivalued: false
turbomam commented 5 months ago

@bmeluch this build on something we were discussion recently

.../SampleData-jgi_mt_data-too-small-value-rna_volume.yaml

jgi_mt_data:
  analysis_type:
    - metatranscriptomics
  dnase_rna: "no"
  proposal_rna: '504000'
  rna_concentration: 100.3
  rna_cont_type: plate
  rna_cont_well: H7
  rna_container_id: cat_ht_1999
  rna_isolate_meth: phenol/cloroform extraction
  rna_project_contact: Leslie Ann Levine
  rna_samp_id: '123456'
  rna_sample_format: MDA reaction buffer
  rna_sample_name: JGI_lagoon_14343
  rna_seq_project: '123456789'
  rna_seq_project_name: JGI Lagoon metatranscritpomics
  rna_seq_project_pi: Patty Smith
  rna_volume: 24 # According to documention: Values <25 only permitted by special permission.
  samp_name: sample name
  source_mat_id: MPI:012345

the schema says:

      rna_volume:
        name: rna_volume
        title: RNA volume in ul
        comments:
          - Units must be in uL. Enter the numerical part only. Value must be 0-1000. This form accepts values < 25, but JGI may refuse to process them unless permission has been granted by a project manager
        examples:
          - value: '25'
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 6
        string_serialization: '{float}'
        owner: Biosample
        domain_of:
          - Biosample
        slot_group: jgi_metatranscriptomics_section
        range: float
        required: true
        recommended: true
        minimum_value: 0
        maximum_value: 1000

24 isn't below the minimum_value so is valid. This example data file could be moved to the valid directory or deleted.

turbomam commented 5 months ago

.../SampleData-soil-data-illegal-env_package.yaml

soil_data:
  - elev: 123
    analysis_type:
      - metagenomics
      - metatranscriptomics
    collection_date: "2022-01-15"
    depth: "0 - 1"
    env_broad_scale: "______urban biome [ENVO:01000249]"
    env_local_scale: "________tunnel [ENVO:00000068]"
    env_medium: "bare soil [ENVO:01001616]"
    geo_loc_name: "USA: Nowhere, Oklahoma"
    growth_facil: greenhouse
    lat_lon: 35.211445 -98.464612
    samp_name: b
    samp_store_temp: "-80 Celsius"
    store_cond: frozen

env_package is defined as a slot in nmdc-schema, but not in the submission schema, and this file doesn't assert one.

It looks like most Biosample records in MongoDB don't have an env_package value. Those that do use the has_raw_value of a TextValue to capture an EnvO CURIe.

I recommend that we delete this file.

turbomam commented 5 months ago

.../SampleData-soil-data-long-decimal-lat_lon.yaml

sediment_data:
  - analysis_type:
      - metagenomics
      - metatranscriptomics
    collection_date: "2022-01-15"
    depth: "0-.1"
    elev: 0
    env_broad_scale: "______urban biome [ENVO:01000249]"
    env_local_scale: "________tunnel [ENVO:00000068]"
    env_medium: "bare soil [ENVO:01001616]"
    geo_loc_name: "USA: Nowhere, Oklahoma"
    lat_lon: 35.2114459 -98.4646126787809678678
    samp_name: a
    samp_store_temp: "-80 Celsius"

I treid to assert a pattern that would limit the number of digits after the decimal point but maybe I got something wrong.

      lat_lon:
        name: lat_lon
        description: The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 system
        title: geographic location (latitude and longitude)
        notes:
          - This is currently a required field but it's not clear if this should be required for human hosts
        from_schema: https://w3id.org/nmdc/nmdc
        rank: 5
        is_a: environment field
        string_serialization: '{lat lon}'
        slot_uri: MIXS:0000009
        multivalued: false
        owner: Biosample
        domain_of:
          - FieldResearchSite
          - Biosample
        slot_group: mixs_modified_section
        range: string
        required: true
        pattern: ^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?)\s[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$

regexr agrees that the lat_lon value matches the pattern. Back to the drawing board.

mslarae13 commented 2 months ago

decreasing the number of examples that pass even though they should be invalid

Have less directories in https://github.com/microbiomedata/submission-schema/tree/main/src/data

@turbomam will check out this branch and determine if it can be merged

mslarae13 commented 1 month ago

@turbomam did you get a chance to evaluate this before the freeze?

@bmeluch was going to look at example files. If not, maybe she can help?

ssarrafan commented 2 weeks ago

@bmeluch @turbomam can you please respond to @mslarae13 questions from 2 weeks ago on this issue? Thanks! I'll move this to the current sprint.

ssarrafan commented 1 week ago

Removing this one from the sprint. No updates in the last month.

turbomam commented 3 days ago

I'm doing the cleanup now