Open turbomam opened 8 months ago
prepare for rebuild of src/schema/mixs.yaml
with with make mixs-yaml-clean
then make all
MAM 2024-04-15: below, I think I changed the source file or URL
column values in assets/import_mixs_slots_regardless.tsv
to https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml
host_family_relation
as an attribute of MimsSoil ancestors or as a slot definition in the schemasalinity_meth
as an attribute of MimsSoil ancestors or as a slot definition in the schemasamp_collec_device
as an attribute of MimsSoil ancestors or as a slot definition in the schemasamp_collect_device
samp_collec_method
as an attribute of MimsSoil ancestors or as a slot definition in the schemasamp_collect_method
soil_text_measure
as an attribute of MimsSoil ancestors or as a slot definition in the schema[soil_texture](https://genomicsstandardsconsortium.github.io/mixs/0000335/)
???ValueError: Conflicting URIs (https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/src/schema/mixs.yaml, https://w3id.org/nmdc/core) for item: sequencing
a subset in both mixs.yaml and core.yaml
not in use by nmdc-schema from core.yaml, so commented out
same thing for
comment out from nmdc.yaml. ignore assets/old_python/reconsititute_mixs.py
same with
environment field
investigation field
nucleic acid sequence source field
sequencing field
ValueError: File "nmdc.yaml", line 639, col 9 Class "Biosample" - unknown slot: "has numeric value" ???
has raw value
has unit
new enum names:
oxy_stat_samp:
range: oxy_stat_samp_enum
ValueError: File "nmdc.yaml", line 1105, col 16 slot: Biosample_oxy_stat_samp - unrecognized range (oxy_stat_samp_enum)
oxy_stat_samp:
description: Oxygenation status of sample
title: oxygenation status of sample
examples:
- value: aerobic
from_schema: https://w3id.org/mixs
keywords:
- oxygen
- sample
- status
slot_uri: MIXS:0000753
range: OXY_STAT_SAMP_ENUM
I'm sacrificing the ability to assign "new" MIxS slots to classes in nmdc.yaml without defining them. Before this branch, the build process would detect assigned and undefined MIxS slots (in Biosample and OmicsProcessing) and add them to our mixs.yaml. But it was only able to retrieve slots from a single MIxS class, like MimsSoil.
Moving forward, if an undefined slot is assigned to a class, the build will fail with an error message like ???
That could be interpreted in many cases that the slot and one of its MIxS classes should be added to assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv
This has resulted in a much shorter project.Makefile
project.Makefile uses yq to convert flat MIxS slot ranges into structured nmdc-schema ranges like QuantityValue.
In this branch, I have removed all range changes for slots that haven't been used in MongoDB yet. Files in the src/data path whose name contains the sub-string iosample-exhasutive
illustrate what ranges were expected in the removed yq slot range updates
For 2024-04-15 discussion with @sujaypatil96 about current import process
Checked out main, fetched, pulled
poetry update
make squeaky-clean all test
No MIxS cleanup is performed by make squeaky-clean
make mixs-yaml-clean
rm -rf src/schema/mixs.yaml rm -rf local/mixs_regen/mixs_subset_modified.yaml
make --dry-run src/schema/mixs.yaml
Depends on
assets/other_mixs_yaml_files/mixs_template.yaml
assets/import_mixs_slots_regardless.tsv
Biosample
and OmicsProcessing
classes for which MIxS slots they usedsheets_and_friends
#rm -rf local/mixs_regen/mixs_subset_modified.yaml # triggers complete regeneration rm -rf local/mixs_regen/mixs_subset.yaml rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak mkdir -p local/mixs_regen touch local/mixs_regen/.gitkeep poetry run do_shuttle \ --recipient_model assets/other_mixs_yaml_files/mixs_template.yaml \ --config_tsv assets/import_mixs_slots_regardless.tsv \ --yaml_output local/mixs_regen/mixs_subset.yaml # switching to TextValue may not add any value. the other range changes do improve the structure of the data. # ironically changing back to strings for the submission-schema, data harmonizer, submission portal etc. # may switch source of truth to the MIxS 6.2.2 release candidate sed 's/quantity value/QuantityValue/' local/mixs_regen/mixs_subset.yaml > local/mixs_regen/mixs_subset_modified.yaml sed -i.bak 's/range: string/range: TextValue/' local/mixs_regen/mixs_subset_modified.yaml sed -i.bak 's/range: text value/range: TextValue/' local/mixs_regen/mixs_subset_modified.yaml grep "^'" assets/yq-for-mixs_subset_modified.txt | while IFS= read -r line ; do echo $line ; eval yq -i $line local/mixs_regen/mixs_subset_modified.yaml ; done rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak # inject re-structured cur_land_use_enum # using '| cat > ' because yq doesn't seem to like redirecting out to a file yq eval-all \ 'select(fileIndex==1).enums.cur_land_use_enum = select(fileIndex==0).enums.cur_land_use_enum | select(fileIndex==1)' \ assets/other_mixs_yaml_files/cur_land_use_enum.yaml local/mixs_regen/mixs_subset_modified.yaml | cat > local/mixs_regen/mixs_subset_modified_inj_land_use.yaml mv local/mixs_regen/mixs_subset_modified_inj_land_use.yaml src/schema/mixs.yaml rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak
Note the use of sed
and yq
for converting the string-oriented MIxS into something more object oriented for nmdc-schema
Most of the yq
instructions come from this text file: assets/yq-for-mixs_subset_modified.txt
That could have been done with modifications_and_validation from sheets_and_friends
I would like to change all of the TextValue
ranges back to string. In fact, I would like to remove the TextValue
class from the schema. That will require notifying all schema users and working with @eecavanna and @brynnz22
As of this comment, the nmdc-schema includes 494 MIxS slots
yq e '.slots | keys' src/schema/mixs.yaml | wc -l
Five of those are grouping slots which aren't associated with any class
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX MIXS: <https://w3id.org/mixs/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
*
where {
{
values ?t {
owl:DatatypeProperty
owl:ObjectProperty
}
graph nmdc:nmdc-no-use-native-uris {
?p rdf:type ?t .
filter(strstarts(str(?p), "https://w3id.org/mixs/"))
}
optional {
?p rdfs:label ?l
}
}
minus {
graph nmdc:nmdc_relation_graph {
?s ?p ?o .
filter(strstarts(str(?p), "https://w3id.org/mixs/"))
}
}
}
t | p | l | |
---|---|---|---|
1 | owl:DatatypeProperty | MIXS:core_field | "core field" |
2 | owl:DatatypeProperty | MIXS:environment_field | "environment field" |
3 | owl:DatatypeProperty | MIXS:nucleic_acid_sequence_source_field | "nucleic acid sequence source field" |
4 | owl:DatatypeProperty | MIXS:investigation_field | "investigation field" |
5 | owl:DatatypeProperty | MIXS:sequencing_field | "sequencing field" |
Only 73 have been used in MongoDB as of 2024-04-11
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
#?st ?p ?l ?ot (count(?s) as ?count)
?p ?l (count(?s) as ?count)
where {
graph <https://api.microbiomedata.org> {
?s ?p ?o .
optional {
?s a ?st
}
optional {
?o a ?ot
}
minus {
?s a ?o
}
}
optional {
?p rdfs:label ?l
}
filter(strstarts(str(?p), "https://w3id.org/mixs/"))
}
group by ?p ?l
order by ?l
row | p | l | count |
---|---|---|---|
1 | MIXS:0000122 | abs_air_humidity | 192 |
2 | MIXS:0000427 | ammonium | 45 |
3 | MIXS:0000142 | avg_temp | 192 |
4 | MIXS:0000432 | calcium | 109 |
5 | MIXS:0000310 | carb_nitro_ratio | 1190 |
6 | MIXS:0000751 | chem_administration | 90 |
7 | MIXS:0000429 | chloride | 6 |
8 | MIXS:0000177 | chlorophyll | 47 |
9 | MIXS:0000011 | collection_date | 7673 |
10 | MIXS:0000692 | conduc | 180 |
11 | MIXS:0000312 | cur_vegetation | 124 |
12 | MIXS:0000018 | depth | 6011 |
13 | MIXS:0000434 | diss_inorg_carb | 25 |
14 | MIXS:0000698 | diss_inorg_nitro | 45 |
15 | MIXS:0000139 | diss_iron | 6 |
16 | MIXS:0000433 | diss_org_carb | 25 |
17 | MIXS:0000119 | diss_oxygen | 435 |
18 | MIXS:0000093 | elev | 6635 |
19 | MIXS:0000012 | env_broad_scale | 8158 |
20 | MIXS:0000013 | env_local_scale | 8158 |
21 | MIXS:0000014 | env_medium | 8158 |
22 | MIXS:0000008 | experimental_factor | 61 |
23 | MIXS:0000556 | fertilizer_regm | 192 |
24 | MIXS:0000010 | geo_loc_name | 8126 |
25 | MIXS:0000875 | gravidity | 61 |
26 | MIXS:0001043 | growth_facil | 328 |
27 | MIXS:0000255 | host_age | 61 |
28 | MIXS:0000866 | host_body_habitat | 61 |
29 | MIXS:0000888 | host_body_product | 61 |
30 | MIXS:0000867 | host_body_site | 61 |
31 | MIXS:0000248 | host_common_name | 253 |
32 | MIXS:0000869 | host_diet | 61 |
33 | MIXS:0000257 | host_dry_mass | 192 |
34 | MIXS:0000365 | host_genotype | 61 |
35 | MIXS:0000264 | host_height | 192 |
36 | MIXS:0000251 | host_life_stage | 61 |
37 | MIXS:0000811 | host_sex | 61 |
38 | MIXS:0000250 | host_taxid | 401 |
39 | MIXS:0000100 | humidity | 192 |
40 | MIXS:0000009 | lat_lon | 8158 |
41 | MIXS:0000431 | magnesium | 109 |
42 | MIXS:0000339 | micro_biomass_meth | 134 |
43 | MIXS:0000425 | nitrate | 18 |
44 | MIXS:0000504 | nitro | 919 |
45 | MIXS:0000508 | org_carb | 1407 |
46 | MIXS:0000754 | perturbation | 61 |
47 | MIXS:0001001 | ph | 4539 |
48 | MIXS:0000725 | photon_flux | 192 |
49 | MIXS:0000430 | potassium | 109 |
50 | MIXS:0000183 | salinity | 19 |
51 | MIXS:0000002 | samp_collec_device | 4322 |
52 | MIXS:0001225 | samp_collec_method | 130 |
53 | MIXS:0001107 | samp_name | 2386 |
54 | MIXS:0000001 | samp_size | 629 |
55 | MIXS:0000110 | samp_store_temp | 328 |
56 | MIXS:0001320 | samp_taxon_id | 1443 |
57 | MIXS:0000322 | sieving | 133 |
58 | MIXS:0000428 | sodium | 6 |
59 | MIXS:0001082 | soil_horizon | 4682 |
60 | MIXS:0000112 | solar_irradiance | 192 |
61 | MIXS:0000738 | soluble_react_phosp | 45 |
62 | MIXS:0000026 | source_mat_id | 253 |
63 | MIXS:0000327 | store_cond | 136 |
64 | MIXS:0000423 | sulfate | 6 |
65 | MIXS:0000113 | temp | 4900 |
66 | MIXS:0000525 | tot_carb | 192 |
67 | MIXS:0000102 | tot_nitro | 103 |
68 | MIXS:0000530 | tot_nitro_content | 1432 |
69 | MIXS:0000533 | tot_org_carb | 32 |
70 | MIXS:0000117 | tot_phosp | 172 |
71 | MIXS:0000185 | water_content | 4331 |
72 | MIXS:0000757 | wind_direction | 192 |
73 | MIXS:0000118 | wind_speed | 192 |
wget https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml
yq e '.slots | keys' mixs.yaml | sed 's/^- //' | sort > mixs.6.2.slots.txt
Where used-73-mixs-slots.csv
is the output of the MIxS slot usage SPARQL above
awk -F',' 'NR>1 {print $2}' used-73-mixs-slots.csv | sort > used-73-mixs-slot-names.txt
comm -23 used-73-mixs-slot-names.txt mixs.6.2.slots.txt
samp_collec_device samp_collec_method
resuming work in this issue/branch
in preparation for a MIxS environmental triad parent (or grouping?) slot, following @aclum's example
for gold_path_field
, which is used as a parent slot, ie there are other slots that assert is_a: gold_path_field
Just reran make mixs-yaml-clean squeaky-clean all test
ValueError: File "nmdc.yaml", line 483, col 9 Class "Biosample" - unknown slot: "host_disease_stat"
host_disease_stat
from src/schema/nmdc.yaml
and assets/import_mixs_slots_regardless.tsv
I started fixing up some instantiation unit tests, but I think they have been deleted from other branches because they don't test anything that src/data
files can't test in combination with the linkml-run-examples
step in the examples/output
target
so I skipped one with @unittest.skip
removed two of @eecavanna 's doc tests in nmdc_schema/nmdc_data.py
about the expected size differential between the asserted and materialized schemas
I would hope to understand this better and then put them back in
was adding seq_meth
annotations to OmicsProcessing
examples even though I don't think we have used them before. After I got past that, I got an error message like this. I don't expect most of those slots to be required for Biosample
s. I think I need to reassess whether anything should ever be considered required from MIxS outside of a slot usage.
[ERROR] 'abs_air_humidity' is a required property in /biosample_set/0 [ERROR] 'add_recov_method' is a required property in /biosample_set/0 [ERROR] 'api' is a required property in /biosample_set/0 [ERROR] 'basin' is a required property in /biosample_set/0 [ERROR] 'build_occup_type' is a required property in /biosample_set/0 [ERROR] 'building_setting' is a required property in /biosample_set/0 [ERROR] 'collection_date' is a required property in /biosample_set/0 [ERROR] 'filter_type' is a required property in /biosample_set/0 [ERROR] 'hc_produced' is a required property in /biosample_set/0 [ERROR] 'hcr' is a required property in /biosample_set/0 ERROR] 'heat_cool_type' is a required property in /biosample_set/0 [ERROR] 'indoor_space' is a required property in /biosample_set/0 [ERROR] 'iwf' is a required property in /biosample_set/0 [ERROR] 'light_type' is a required property in /biosample_set/0 [ERROR] 'occup_density_samp' is a required property in /biosample_set/0 [ERROR] 'occup_samp' is a required property in /biosample_set/0 [ERROR] 'rel_air_humidity' is a required property in /biosample_set/0 [ERROR] 'samp_collect_point' is a required property in /biosample_set/0 [ERROR] 'samp_taxon_id' is a required property in /biosample_set/0 [ERROR] 'samp_type' is a required property in /biosample_set/0 [ERROR] 'seq_meth' is a required property in /biosample_set/0 [ERROR] 'space_typ_state' is a required property in /biosample_set/0 [ERROR] 'typ_occup_density' is a required property in /biosample_set/0 [ERROR] 'water_cut' is a required property in /biosample_set/0 ValueError: Example src/data/valid/Database-nmdc-example.yaml failed validation:
There are 483 slots in src/schema/mixs.yaml
now, as measured by searching for slot_uri
There are 489 lines including a header on assets/import_mixs_slots_regardless.tsv
Do we really think we're going to use all of those? See
If I was going to remove some of them, would I have to go through soemthign like a deprecation r
30 of them are marked required based on required: true
25 of those appeared in the error report above
Added 'del(.slots[].required)'
to assets/yq-for-mixs_subset_modified.txt
Making progress
ValueError: Example src/data/valid/Biosample-soil_horizon.yaml failed validation: [ERROR] 'M horizon' is not one of ['A horizon', 'B horizon', 'C horizon', 'E horizon', 'O horizon', 'Permafrost', 'R layer'] in /soil_horizon
Unfortunately just noticed that the source file or URL
values in assets/import_mixs_slots_regardless.tsv
are all still
next step: change them to:
https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml
https://github.com/GenomicsStandardsConsortium/mixs/commit/1da849346a80b717810a02d7c8ed74a22bcd84de is from Apr 3, 2023
https://github.com/GenomicsStandardsConsortium/mixs/releases/tag/v6.2.0 is from Oct 18, 2023
I updated assets/import_mixs_slots_regardless.tsv
and the next build still go to the same error message. Maybe the two commits aren't that different.
See also assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv
... we shouldn't be using both
target local/mixs_regen/mixs_subset.yaml
uses assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv
as a dependency
assets/import_mixs_slots_regardless.tsv
isn't actually used anywhere! I was editing it today to no end. Deleting.
MIxS enums have all upper snake case names now like SOIL_HORIZON_ENUM
. Fixed in assets/yq-for-mixs_subset_modified.txt
make mixs-yaml-clean squeaky-clean all test
completes
running make make-rdf
now. Expecting some validation errors.
mixs.yaml
needs to have patterns materialized ?
or the settings
need to be added from https://github.com/GenomicsStandardsConsortium/mixs/blob/main/src/mixs/schema/mixs.yaml
?
wc -l local/mongo_as_nmdc_database_validation.log.txt
29820 local/mongo_as_nmdc_database_validation.log.txt
still
29820 local/mongo_as_nmdc_database_validation.log
???
check submission schema
currently slightly ahead of MIxS 6.0: https://raw.githubusercontent.com/microbiomedata/mixs/1da849346a80b717810a02d7c8ed74a22bcd84de/model/schema/mixs.yaml
change to: https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml
see also