microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

change nmdc-schema's MIxS import to GSC's 6.2 YAML #1368

Open turbomam opened 8 months ago

turbomam commented 8 months ago

currently slightly ahead of MIxS 6.0: https://raw.githubusercontent.com/microbiomedata/mixs/1da849346a80b717810a02d7c8ed74a22bcd84de/model/schema/mixs.yaml

change to: https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml

see also

turbomam commented 8 months ago

prepare for rebuild of src/schema/mixs.yaml with with make mixs-yaml-clean

then make all

turbomam commented 8 months ago

MAM 2024-04-15: below, I think I changed the source file or URL column values in assets/import_mixs_slots_regardless.tsv to https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml

turbomam commented 8 months ago
turbomam commented 8 months ago
turbomam commented 8 months ago
turbomam commented 8 months ago
turbomam commented 8 months ago

ValueError: Conflicting URIs (https://raw.githubusercontent.com/microbiomedata/nmdc-schema/main/src/schema/mixs.yaml, https://w3id.org/nmdc/core) for item: sequencing

a subset in both mixs.yaml and core.yaml

not in use by nmdc-schema from core.yaml, so commented out


same thing for

turbomam commented 8 months ago

comment out from nmdc.yaml. ignore assets/old_python/reconsititute_mixs.py

same with

turbomam commented 8 months ago

ValueError: File "nmdc.yaml", line 639, col 9 Class "Biosample" - unknown slot: "has numeric value" ???

turbomam commented 8 months ago

new enum names:

      oxy_stat_samp:
        range: oxy_stat_samp_enum

ValueError: File "nmdc.yaml", line 1105, col 16 slot: Biosample_oxy_stat_samp - unrecognized range (oxy_stat_samp_enum)

  oxy_stat_samp:
    description: Oxygenation status of sample
    title: oxygenation status of sample
    examples:
      - value: aerobic
    from_schema: https://w3id.org/mixs
    keywords:
      - oxygen
      - sample
      - status
    slot_uri: MIXS:0000753
    range: OXY_STAT_SAMP_ENUM
turbomam commented 8 months ago

I'm sacrificing the ability to assign "new" MIxS slots to classes in nmdc.yaml without defining them. Before this branch, the build process would detect assigned and undefined MIxS slots (in Biosample and OmicsProcessing) and add them to our mixs.yaml. But it was only able to retrieve slots from a single MIxS class, like MimsSoil.

Moving forward, if an undefined slot is assigned to a class, the build will fail with an error message like ???

That could be interpreted in many cases that the slot and one of its MIxS classes should be added to assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv

This has resulted in a much shorter project.Makefile

turbomam commented 8 months ago

project.Makefile uses yq to convert flat MIxS slot ranges into structured nmdc-schema ranges like QuantityValue.

In this branch, I have removed all range changes for slots that haven't been used in MongoDB yet. Files in the src/data path whose name contains the sub-string iosample-exhasutive illustrate what ranges were expected in the removed yq slot range updates

turbomam commented 3 months ago

For 2024-04-15 discussion with @sujaypatil96 about current import process

Checked out main, fetched, pulled

poetry update
make squeaky-clean all test

No MIxS cleanup is performed by make squeaky-clean

make mixs-yaml-clean

rm -rf src/schema/mixs.yaml rm -rf local/mixs_regen/mixs_subset_modified.yaml

make --dry-run src/schema/mixs.yaml

Depends on


#rm -rf local/mixs_regen/mixs_subset_modified.yaml # triggers complete regeneration
rm -rf local/mixs_regen/mixs_subset.yaml
rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak
mkdir -p local/mixs_regen
touch local/mixs_regen/.gitkeep
poetry run do_shuttle \
        --recipient_model assets/other_mixs_yaml_files/mixs_template.yaml \
        --config_tsv assets/import_mixs_slots_regardless.tsv \
        --yaml_output local/mixs_regen/mixs_subset.yaml
# switching to TextValue may not add any value. the other range changes do improve the structure of the data.
# ironically changing back to strings for the submission-schema, data harmonizer, submission portal etc.
# may switch source of truth to the MIxS 6.2.2 release candidate
sed 's/quantity value/QuantityValue/' local/mixs_regen/mixs_subset.yaml > local/mixs_regen/mixs_subset_modified.yaml
sed -i.bak 's/range: string/range: TextValue/' local/mixs_regen/mixs_subset_modified.yaml
sed -i.bak 's/range: text value/range: TextValue/' local/mixs_regen/mixs_subset_modified.yaml
grep "^'" assets/yq-for-mixs_subset_modified.txt | while IFS= read -r line ; do echo $line ; eval yq -i $line local/mixs_regen/mixs_subset_modified.yaml ; done
rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak
# inject re-structured cur_land_use_enum
#   using '| cat > ' because yq doesn't seem to like redirecting out to a file
yq eval-all \
        'select(fileIndex==1).enums.cur_land_use_enum = select(fileIndex==0).enums.cur_land_use_enum | select(fileIndex==1)' \
        assets/other_mixs_yaml_files/cur_land_use_enum.yaml local/mixs_regen/mixs_subset_modified.yaml | cat > local/mixs_regen/mixs_subset_modified_inj_land_use.yaml
mv local/mixs_regen/mixs_subset_modified_inj_land_use.yaml src/schema/mixs.yaml
rm -rf local/mixs_regen/mixs_subset_modified.yaml.bak

Note the use of sed and yq for converting the string-oriented MIxS into something more object oriented for nmdc-schema

Most of the yq instructions come from this text file: assets/yq-for-mixs_subset_modified.txt

That could have been done with modifications_and_validation from sheets_and_friends

I would like to change all of the TextValue ranges back to string. In fact, I would like to remove the TextValue class from the schema. That will require notifying all schema users and working with @eecavanna and @brynnz22

turbomam commented 3 months ago

As of this comment, the nmdc-schema includes 494 MIxS slots

yq e '.slots | keys' src/schema/mixs.yaml  | wc -l 

Five of those are grouping slots which aren't associated with any class

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX MIXS: <https://w3id.org/mixs/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select 
*
where {
    {
        values ?t {
            owl:DatatypeProperty
            owl:ObjectProperty
        }
        graph nmdc:nmdc-no-use-native-uris {
            ?p rdf:type ?t .
            filter(strstarts(str(?p), "https://w3id.org/mixs/"))
        }
        optional {
            ?p rdfs:label ?l
        }
    }
    minus {
        graph nmdc:nmdc_relation_graph {
            ?s ?p ?o .
            filter(strstarts(str(?p), "https://w3id.org/mixs/"))
        }
    }
}
  t p l
1 owl:DatatypeProperty MIXS:core_field "core field"
2 owl:DatatypeProperty MIXS:environment_field "environment field"
3 owl:DatatypeProperty MIXS:nucleic_acid_sequence_source_field "nucleic acid sequence source field"
4 owl:DatatypeProperty MIXS:investigation_field "investigation field"
5 owl:DatatypeProperty MIXS:sequencing_field "sequencing field"
turbomam commented 3 months ago

Only 73 have been used in MongoDB as of 2024-04-11

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
#?st ?p ?l ?ot (count(?s) as ?count)
?p ?l (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s ?p ?o .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        minus {
            ?s a ?o
        }
    }
    optional {
        ?p rdfs:label ?l
    }
    filter(strstarts(str(?p), "https://w3id.org/mixs/"))
}
group by ?p ?l 
order by ?l
row p l count
1 MIXS:0000122 abs_air_humidity 192
2 MIXS:0000427 ammonium 45
3 MIXS:0000142 avg_temp 192
4 MIXS:0000432 calcium 109
5 MIXS:0000310 carb_nitro_ratio 1190
6 MIXS:0000751 chem_administration 90
7 MIXS:0000429 chloride 6
8 MIXS:0000177 chlorophyll 47
9 MIXS:0000011 collection_date 7673
10 MIXS:0000692 conduc 180
11 MIXS:0000312 cur_vegetation 124
12 MIXS:0000018 depth 6011
13 MIXS:0000434 diss_inorg_carb 25
14 MIXS:0000698 diss_inorg_nitro 45
15 MIXS:0000139 diss_iron 6
16 MIXS:0000433 diss_org_carb 25
17 MIXS:0000119 diss_oxygen 435
18 MIXS:0000093 elev 6635
19 MIXS:0000012 env_broad_scale 8158
20 MIXS:0000013 env_local_scale 8158
21 MIXS:0000014 env_medium 8158
22 MIXS:0000008 experimental_factor 61
23 MIXS:0000556 fertilizer_regm 192
24 MIXS:0000010 geo_loc_name 8126
25 MIXS:0000875 gravidity 61
26 MIXS:0001043 growth_facil 328
27 MIXS:0000255 host_age 61
28 MIXS:0000866 host_body_habitat 61
29 MIXS:0000888 host_body_product 61
30 MIXS:0000867 host_body_site 61
31 MIXS:0000248 host_common_name 253
32 MIXS:0000869 host_diet 61
33 MIXS:0000257 host_dry_mass 192
34 MIXS:0000365 host_genotype 61
35 MIXS:0000264 host_height 192
36 MIXS:0000251 host_life_stage 61
37 MIXS:0000811 host_sex 61
38 MIXS:0000250 host_taxid 401
39 MIXS:0000100 humidity 192
40 MIXS:0000009 lat_lon 8158
41 MIXS:0000431 magnesium 109
42 MIXS:0000339 micro_biomass_meth 134
43 MIXS:0000425 nitrate 18
44 MIXS:0000504 nitro 919
45 MIXS:0000508 org_carb 1407
46 MIXS:0000754 perturbation 61
47 MIXS:0001001 ph 4539
48 MIXS:0000725 photon_flux 192
49 MIXS:0000430 potassium 109
50 MIXS:0000183 salinity 19
51 MIXS:0000002 samp_collec_device 4322
52 MIXS:0001225 samp_collec_method 130
53 MIXS:0001107 samp_name 2386
54 MIXS:0000001 samp_size 629
55 MIXS:0000110 samp_store_temp 328
56 MIXS:0001320 samp_taxon_id 1443
57 MIXS:0000322 sieving 133
58 MIXS:0000428 sodium 6
59 MIXS:0001082 soil_horizon 4682
60 MIXS:0000112 solar_irradiance 192
61 MIXS:0000738 soluble_react_phosp 45
62 MIXS:0000026 source_mat_id 253
63 MIXS:0000327 store_cond 136
64 MIXS:0000423 sulfate 6
65 MIXS:0000113 temp 4900
66 MIXS:0000525 tot_carb 192
67 MIXS:0000102 tot_nitro 103
68 MIXS:0000530 tot_nitro_content 1432
69 MIXS:0000533 tot_org_carb 32
70 MIXS:0000117 tot_phosp 172
71 MIXS:0000185 water_content 4331
72 MIXS:0000757 wind_direction 192
73 MIXS:0000118 wind_speed 192
turbomam commented 3 months ago
wget https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml
yq e '.slots | keys' mixs.yaml | sed 's/^- //' | sort > mixs.6.2.slots.txt
turbomam commented 3 months ago

Where used-73-mixs-slots.csv is the output of the MIxS slot usage SPARQL above

awk -F',' 'NR>1 {print $2}' used-73-mixs-slots.csv | sort > used-73-mixs-slot-names.txt
turbomam commented 3 months ago
comm -23 used-73-mixs-slot-names.txt mixs.6.2.slots.txt 

These are the only slots that we are using that aren't present verbatim in MIxS 6.2!

samp_collec_device samp_collec_method

turbomam commented 1 week ago

resuming work in this issue/branch

in preparation for a MIxS environmental triad parent (or grouping?) slot, following @aclum's example

for gold_path_field, which is used as a parent slot, ie there are other slots that assert is_a: gold_path_field

turbomam commented 1 week ago

Just reran make mixs-yaml-clean squeaky-clean all test

ValueError: File "nmdc.yaml", line 483, col 9 Class "Biosample" - unknown slot: "host_disease_stat"

turbomam commented 1 week ago

I started fixing up some instantiation unit tests, but I think they have been deleted from other branches because they don't test anything that src/data files can't test in combination with the linkml-run-examples step in the examples/output target

so I skipped one with @unittest.skip

turbomam commented 1 week ago

removed two of @eecavanna 's doc tests in nmdc_schema/nmdc_data.py about the expected size differential between the asserted and materialized schemas

I would hope to understand this better and then put them back in

turbomam commented 1 week ago

was adding seq_meth annotations to OmicsProcessing examples even though I don't think we have used them before. After I got past that, I got an error message like this. I don't expect most of those slots to be required for Biosamples. I think I need to reassess whether anything should ever be considered required from MIxS outside of a slot usage.

[ERROR] 'abs_air_humidity' is a required property in /biosample_set/0 [ERROR] 'add_recov_method' is a required property in /biosample_set/0 [ERROR] 'api' is a required property in /biosample_set/0 [ERROR] 'basin' is a required property in /biosample_set/0 [ERROR] 'build_occup_type' is a required property in /biosample_set/0 [ERROR] 'building_setting' is a required property in /biosample_set/0 [ERROR] 'collection_date' is a required property in /biosample_set/0 [ERROR] 'filter_type' is a required property in /biosample_set/0 [ERROR] 'hc_produced' is a required property in /biosample_set/0 [ERROR] 'hcr' is a required property in /biosample_set/0 ERROR] 'heat_cool_type' is a required property in /biosample_set/0 [ERROR] 'indoor_space' is a required property in /biosample_set/0 [ERROR] 'iwf' is a required property in /biosample_set/0 [ERROR] 'light_type' is a required property in /biosample_set/0 [ERROR] 'occup_density_samp' is a required property in /biosample_set/0 [ERROR] 'occup_samp' is a required property in /biosample_set/0 [ERROR] 'rel_air_humidity' is a required property in /biosample_set/0 [ERROR] 'samp_collect_point' is a required property in /biosample_set/0 [ERROR] 'samp_taxon_id' is a required property in /biosample_set/0 [ERROR] 'samp_type' is a required property in /biosample_set/0 [ERROR] 'seq_meth' is a required property in /biosample_set/0 [ERROR] 'space_typ_state' is a required property in /biosample_set/0 [ERROR] 'typ_occup_density' is a required property in /biosample_set/0 [ERROR] 'water_cut' is a required property in /biosample_set/0 ValueError: Example src/data/valid/Database-nmdc-example.yaml failed validation:

turbomam commented 1 week ago

There are 483 slots in src/schema/mixs.yaml now, as measured by searching for slot_uri

There are 489 lines including a header on assets/import_mixs_slots_regardless.tsv

Do we really think we're going to use all of those? See

If I was going to remove some of them, would I have to go through soemthign like a deprecation r

30 of them are marked required based on required: true

25 of those appeared in the error report above

turbomam commented 1 week ago

Added 'del(.slots[].required)' to assets/yq-for-mixs_subset_modified.txt

Making progress

ValueError: Example src/data/valid/Biosample-soil_horizon.yaml failed validation: [ERROR] 'M horizon' is not one of ['A horizon', 'B horizon', 'C horizon', 'E horizon', 'O horizon', 'Permafrost', 'R layer'] in /soil_horizon

turbomam commented 1 week ago

Unfortunately just noticed that the source file or URL values in assets/import_mixs_slots_regardless.tsv are all still

https://raw.githubusercontent.com/microbiomedata/mixs/1da849346a80b717810a02d7c8ed74a22bcd84de/model/schema/mixs.yaml

next step: change them to:

https://raw.githubusercontent.com/GenomicsStandardsConsortium/mixs/v6.2.0/src/mixs/schema/mixs.yaml

turbomam commented 1 week ago

https://github.com/GenomicsStandardsConsortium/mixs/commit/1da849346a80b717810a02d7c8ed74a22bcd84de is from Apr 3, 2023

https://github.com/GenomicsStandardsConsortium/mixs/releases/tag/v6.2.0 is from Oct 18, 2023

I updated assets/import_mixs_slots_regardless.tsv and the next build still go to the same error message. Maybe the two commits aren't that different.

See also assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv... we shouldn't be using both

target local/mixs_regen/mixs_subset.yaml uses assets/other_mixs_yaml_files/mixs_slots_import_sheet.tsv as a dependency

assets/import_mixs_slots_regardless.tsv isn't actually used anywhere! I was editing it today to no end. Deleting.

turbomam commented 1 week ago

MIxS enums have all upper snake case names now like SOIL_HORIZON_ENUM. Fixed in assets/yq-for-mixs_subset_modified.txt

turbomam commented 1 week ago

make mixs-yaml-clean squeaky-clean all test completes

running make make-rdf now. Expecting some validation errors.

turbomam commented 1 week ago

mixs.yaml needs to have patterns materialized ?

or the settings need to be added from https://github.com/GenomicsStandardsConsortium/mixs/blob/main/src/mixs/schema/mixs.yaml?

wc -l local/mongo_as_nmdc_database_validation.log.txt

29820 local/mongo_as_nmdc_database_validation.log.txt

turbomam commented 1 week ago

still

29820 local/mongo_as_nmdc_database_validation.log

???

turbomam commented 1 week ago

check submission schema