microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

`nmdc:bsm-11-1gzgce32` does not validate against berkeley-schema-fy24 following `make-rdf` #1667

Closed turbomam closed 4 months ago

turbomam commented 10 months ago

In branch PR10-type-slot-required-migration

partial project.Makefile:

local/mongo_as_unvalidated_nmdc_database.yaml:
    date  # 276.50 seconds on 2023-08-30 without functional_annotation_agg or metaproteomics_analysis_set
    time $(RUN) pure-export \
        --client-base-url https://api.microbiomedata.org \
        --endpoint-prefix nmdcschema \
        --env-file local/.env \
        --max-docs-per-coll 200000 \
        --mongo-db-name nmdc \
        --mongo-host localhost \
        --mongo-port 27777 \
        --output-yaml $@ \
        --page-size 200000 \
        --schema-file src/schema/nmdc.yaml \
        --selected-collections biosample_set \
        --skip-collection-check

local/mongo_as_nmdc_database_rdf_safe.yaml: nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml local/mongo_as_unvalidated_nmdc_database.yaml
    date # 449.56 seconds on 2023-08-30 without functional_annotation_agg or metaproteomics_analysis_set
    time $(RUN) migration-recursion \
        --migrator-name Migrator_from_X_to_PR10 \
        --schema-path $(word 1,$^) \
        --input-path $(word 2,$^) \
        --salvage-prefix generic \
        --output-path $@
turbomam commented 10 months ago

This auto-formatted extract does validate with

poetry run linkml-validate --schema src/schema/nmdc.yaml --target-class Database local/extract.yaml

biosample_set:
  - analysis_type:
      - metagenomics
    chem_administration:
      - has_raw_value: prednisone [CHEBI:8382];2019-09-18
        term:
          id: CHEBI:8382
          name: prednisone
          type: nmdc:OntologyClass
        type: nmdc:ControlledTermValue
      - has_raw_value: ondansetron [CHEBI:7773];2019-09-18
        term:
          id: CHEBI:7773
          name: ondansetron
          type: nmdc:OntologyClass
        type: nmdc:ControlledTermValue
    collection_date:
      has_raw_value: '2019-09-18'
      type: nmdc:TimestampValue
    depth:
      has_numeric_value: 0.0
      has_raw_value: '0'
      type: nmdc:QuantityValue
    elev: 23.1648
    env_broad_scale:
      has_raw_value: Animal-associated environment [ENVO:01001002]
      term:
        id: ENVO:01001002
        name: Animal-associated environment
        type: nmdc:OntologyClass
      type: nmdc:ControlledIdentifiedTermValue
    env_local_scale:
      has_raw_value: feces material [ENVO:00002003]
      term:
        id: ENVO:00002003
        name: feces material
        type: nmdc:OntologyClass
      type: nmdc:ControlledIdentifiedTermValue
    env_medium:
      has_raw_value: feces material [ENVO:00002003]
      term:
        id: ENVO:00002003
        name: feces material
        type: nmdc:OntologyClass
      type: nmdc:ControlledIdentifiedTermValue
    env_package:
      has_raw_value: Host-associated
      type: nmdc:TextValue
    experimental_factor:
      has_raw_value: antibiotic treatment
      type: nmdc:ControlledTermValue
    geo_loc_name:
      has_raw_value: 'USA: California, Davis'
      type: nmdc:TextValue
    gravidity:
      has_raw_value: 'no'
      type: nmdc:TextValue
    host_age:
      has_numeric_value: 9.0
      has_raw_value: 9 years
      has_unit: years
      type: nmdc:QuantityValue
    host_body_habitat:
      has_raw_value: gastrointestinal tract
      type: nmdc:TextValue
    host_body_product:
      has_raw_value: Feces [UBERON:0001988]
      term:
        id: UBERON:0001988
        name: Feces
        type: nmdc:OntologyClass
      type: nmdc:ControlledTermValue
    host_body_site:
      has_raw_value: large intestine [UBERON:0000059]
      term:
        id: UBERON:0000059
        name: large intestine
        type: nmdc:OntologyClass
      type: nmdc:ControlledTermValue
    host_common_name:
      has_raw_value: Canine
      type: nmdc:TextValue
    host_diet:
      - has_raw_value: Royal Canin low fat
        type: nmdc:TextValue
    host_genotype:
      has_raw_value: Golden retreiver
      type: nmdc:TextValue
    host_life_stage:
      has_raw_value: adult
      type: nmdc:TextValue
    host_sex: female
    id: nmdc:bsm-11-1gzgce32
    lat_lon:
      has_raw_value: 38.5382 -121.7617
      latitude: 38.5382
      longitude: -121.7617
      type: nmdc:GeolocationValue
    name: Canine with Bacteremia, urinary tract infection - treatment with Prednisone,
      Ondansetron (HS25_11)
    part_of:
      - nmdc:sty-11-hdd4bf83
    perturbation:
      - has_raw_value: Bacteremia, urinary tract infection
        type: nmdc:TextValue
    samp_name: Canine with Bacteremia, urinary tract infection - treatment with Prednisone,
      Ondansetron (HS25_11)
    source_mat_id:
      has_raw_value: UUID:67847769-4444-37f0-a011-f43daf6485aa
      type: nmdc:TextValue
    type: nmdc:Biosample
turbomam commented 10 months ago

try Michael's get-study-related-records with nmdc:sty-11-hdd4bf83 (instaed of the pre-loaded nmdc:sty-11-aygzgv51)

turbomam commented 10 months ago
local/nmdc-sty-11-hdd4bf83.yaml:
    $(RUN) get-study-related-records \
        --api-base-url https://api.microbiomedata.org \
        extract-study \
        --study-id $(subst nmdc-,nmdc:,$(basename $(notdir $@))) \
        --output-file $@

local/nmdc-sty-11-hdd4bf83-validation.log: nmdc_schema/nmdc_schema_accepting_legacy_ids.py local/nmdc-sty-11-hdd4bf83.yaml
    # - allows the makefiel to continue even if this step reports an error. that may or may not be the best choice in this case
    - $(RUN) linkml-validate --schema $^ > $@

This get-study-related-records took 27 minutes from my home in Philadelphia. I have 250 Mbps service. I have a Intel NUC with 48 GB RAM, an i9 wiht 12 cores that can burst to 4.5 GHz and a NVME SSD.

turbomam commented 10 months ago

napa's nmdc:sty-11-aygzgv51 = production gold:Gs0114663

turbomam commented 10 months ago

There are Biosamples with chem_administrations in valid example files like src/data/valid/Biosample-exhaustive-issue-796-bye-yq-for-7-4-10.yaml, but their ControlledTermValues only have has_raw_values

chem_administration:
  - has_raw_value: agar [CHEBI:2509];2018-05-11T20:00Z
turbomam commented 10 months ago

should add a full-fledged chem_administration to the examples

turbomam commented 10 months ago

nmdc:bsm-11-1gzgce32 is present in the production MongoDB, with a full-fledged chem_administration

https://api.microbiomedata.org/nmdcschema/ids/nmdc%3Absm-11-1gzgce32

turbomam commented 10 months ago
poetry run linkml-validate \
  --schema nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml \
  --target-class Biosample \
  src/data/valid/Biosample-bsm-11-1gzgce32-with-chem_adminstration.yaml 

No issues found

aclum commented 4 months ago

@turbomam can this be closed?

turbomam commented 4 months ago
wget -O bsm-11-1gzgce32.json "https://api.microbiomedata.org/nmdcschema/ids/nmdc%3Absm-11-1gzgce32"
linkml-validate --schema nmdc_schema/nmdc_materialized_patterns.yaml --target-class Biosample bsm-11-1gzgce32.json 

No issues found