Open turbomam opened 5 months ago
SampleData-jgi_mg_data-colon-dna_sample_name.yaml
_in src/data/unexpectedpass/
jgi_mg_data:
- analysis_type:
- metagenomics
- metatranscriptomics
dna_concentration: 1.23
dna_cont_type: plate
dna_cont_well: C3
dna_container_id: xxx
dna_dnase: "no"
dna_isolate_meth: xxx
dna_project_contact: xxx
dna_samp_id: xxx
dna_sample_format: DNAStable
dna_sample_name: DNA:0546789 # contains colon and should not be allowed 'a-z, A-Z, 0-9, - and _ only'
dna_seq_project: xxx
dna_seq_project_name: xxx
dna_seq_project_pi: xxx
dna_volume: 0
proposal_dna: xxx
samp_name: xxx
The slot definition in src/nmdc_submission_schema/schema/nmdc_submission_schema.yaml
mentions the legal characters for dna_sample_name
, but doesn't actually define a pattern
dna_sample_name:
name: dna_sample_name
description: Give the DNA sample a name that is meaningful to you. Sample names must be unique across all JGI projects and contain a-z, A-Z, 0-9, - and _ only.
title: DNA sample name
examples:
- value: JGI_pond_041618
from_schema: https://w3id.org/nmdc/nmdc
rank: 4
string_serialization: '{text}'
owner: Biosample
domain_of:
- Biosample
slot_group: jgi_metagenomics_section
range: string
required: true
recommended: true
multivalued: false
SampleData-jgi_mt_data-illegal-rna_seq_project.yaml
same logic for
jgi_mt_data:
- analysis_type:
- metatranscriptomics
dnase_rna: "no"
proposal_rna: '504000'
rna_concentration: 100.3
rna_cont_type: plate
rna_cont_well: H7
rna_container_id: cat_ht_1999
rna_isolate_meth: phenol/cloroform extraction
rna_project_contact: Leslie Ann Levine
rna_samp_id: '123456'
rna_sample_format: MDA reaction buffer
rna_sample_name: JGI_lagoon_14343
rna_seq_project: An RNA Sequencing Project # Values will be prefilled by NMDC. Appears to accept any string.
rna_seq_project_name: JGI Lagoon metatranscritpomics
rna_seq_project_pi: Patty Smith
rna_volume: 25.1
samp_name: sample name
source_mat_id: MPI:012345
But the schema doesn't actually prevent users from entering data in any of those slots. I believe there may be some separate issue about "locking" some slots/columns, but the schema doesn't know about that.
rna_seq_project:
name: rna_seq_project
title: RNA seq project ID
comments:
- Do not edit these values. A template will be provided by NMDC in which these values have been pre-filled.
examples:
- value: '1191234'
from_schema: https://w3id.org/nmdc/nmdc
rank: 1
string_serialization: '{text}'
owner: Biosample
domain_of:
- Biosample
slot_group: jgi_metatranscriptomics_section
range: string
required: true
recommended: true
multivalued: false
@bmeluch this build on something we were discussion recently
/SampleData-jgi_mt_data-too-small-value-rna_volume.yaml
jgi_mt_data:
analysis_type:
- metatranscriptomics
dnase_rna: "no"
proposal_rna: '504000'
rna_concentration: 100.3
rna_cont_type: plate
rna_cont_well: H7
rna_container_id: cat_ht_1999
rna_isolate_meth: phenol/cloroform extraction
rna_project_contact: Leslie Ann Levine
rna_samp_id: '123456'
rna_sample_format: MDA reaction buffer
rna_sample_name: JGI_lagoon_14343
rna_seq_project: '123456789'
rna_seq_project_name: JGI Lagoon metatranscritpomics
rna_seq_project_pi: Patty Smith
rna_volume: 24 # According to documention: Values <25 only permitted by special permission.
samp_name: sample name
source_mat_id: MPI:012345
the schema says:
rna_volume:
name: rna_volume
title: RNA volume in ul
comments:
- Units must be in uL. Enter the numerical part only. Value must be 0-1000. This form accepts values < 25, but JGI may refuse to process them unless permission has been granted by a project manager
examples:
- value: '25'
from_schema: https://w3id.org/nmdc/nmdc
rank: 6
string_serialization: '{float}'
owner: Biosample
domain_of:
- Biosample
slot_group: jgi_metatranscriptomics_section
range: float
required: true
recommended: true
minimum_value: 0
maximum_value: 1000
24 isn't below the minimum_value
so is valid. This example data file could be moved to the valid directory or deleted.
SampleData-soil-data-illegal-env_package.yaml
soil_data:
- elev: 123
analysis_type:
- metagenomics
- metatranscriptomics
collection_date: "2022-01-15"
depth: "0 - 1"
env_broad_scale: "______urban biome [ENVO:01000249]"
env_local_scale: "________tunnel [ENVO:00000068]"
env_medium: "bare soil [ENVO:01001616]"
geo_loc_name: "USA: Nowhere, Oklahoma"
growth_facil: greenhouse
lat_lon: 35.211445 -98.464612
samp_name: b
samp_store_temp: "-80 Celsius"
store_cond: frozen
env_package
is defined as a slot in nmdc-schema, but not in the submission schema, and this file doesn't assert one.
It looks like most Biosample records in MongoDB don't have an env_package
value. Those that do use the has_raw_value
of a TextValue
to capture an EnvO CURIe.
I recommend that we delete this file.
SampleData-soil-data-long-decimal-lat_lon.yaml
sediment_data:
- analysis_type:
- metagenomics
- metatranscriptomics
collection_date: "2022-01-15"
depth: "0-.1"
elev: 0
env_broad_scale: "______urban biome [ENVO:01000249]"
env_local_scale: "________tunnel [ENVO:00000068]"
env_medium: "bare soil [ENVO:01001616]"
geo_loc_name: "USA: Nowhere, Oklahoma"
lat_lon: 35.2114459 -98.4646126787809678678
samp_name: a
samp_store_temp: "-80 Celsius"
I treid to assert a pattern that would limit the number of digits after the decimal point but maybe I got something wrong.
lat_lon:
name: lat_lon
description: The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 system
title: geographic location (latitude and longitude)
notes:
- This is currently a required field but it's not clear if this should be required for human hosts
from_schema: https://w3id.org/nmdc/nmdc
rank: 5
is_a: environment field
string_serialization: '{lat lon}'
slot_uri: MIXS:0000009
multivalued: false
owner: Biosample
domain_of:
- FieldResearchSite
- Biosample
slot_group: mixs_modified_section
range: string
required: true
pattern: ^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?)\s[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$
regexr agrees that the lat_lon
value matches the pattern. Back to the drawing board.
decreasing the number of examples that pass even though they should be invalid
Have less directories in https://github.com/microbiomedata/submission-schema/tree/main/src/data
@turbomam will check out this branch and determine if it can be merged
@turbomam did you get a chance to evaluate this before the freeze?
@bmeluch was going to look at example files. If not, maybe she can help?
@bmeluch @turbomam can you please respond to @mslarae13 questions from 2 weeks ago on this issue? Thanks! I'll move this to the current sprint.
Removing this one from the sprint. No updates in the last month.
I'm doing the cleanup now
src/data/examples
data/unexpected_pass