microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

any future addition or modification of slots requires `domain` assertions #1476

Closed turbomam closed 1 month ago

turbomam commented 7 months ago

When we say that some Biosample has total_strontium {'has_numerical_value': 15, 'has_unit': 'ppm'} , we are saying that the domain includes, at a minimum, the Biosample class. Possibly it includes additional classes, which hopefully would come from the same branch of the class hierarchy. Maybe ProcessedSamples could also have a total_strontium value. If that were the case, than reasonable domain for total_strontium would be MaterialSample, given the current classes in the schema.

turbomam commented 7 months ago

doing this for object properties (slots that relate an instance of one class to another class instance) is the top priority. Among other things, it will allow us to make helpful visualizations.

Doing it for data properties (slots that relate an instance of some class to values, like one string, or a list of integers, etc) might be a lower priority.

turbomam commented 7 months ago

this query isn't perfect for this task but it does highlight sever sub-optimal patterns

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX linkml: <https://w3id.org/linkml/>
select ?l ?r
where {
    graph nmdc:nmdc {
        ?s a owl:ObjectProperty .
        minus {
            ?s rdfs:domain ?d
        }
        optional {
            ?s rdfs:range ?r
        }
        minus {
            ?s rdfs:range linkml:String
        }
        minus {
            ?s rdfs:range linkml:Float
        }
        minus {
            ?s rdfs:range linkml:Integer
        }
        minus {
            ?s rdfs:range linkml:Boolean
        }
        minus {
            ?s rdfs:range linkml:Uriorcurie
        }
        optional {
            ?s rdfs:label ?l
        }
        filter(strstarts(str(?s), "https://w3id.org/nmdc/")) # MIXS is the only other namespace 2023-12-08
        filter(!strends(str(?s), "_set")) # in progress
    }
}
order by ?l
turbomam commented 7 months ago
?l used in organizastional inc mixin to remove data property not in use
input_volume -> PlannedProcess ?
has_unit !!! Biosample !!! String
ended_at_time Activity
execution_resource Activity String
started_at_time Activity String
version Activity String
was_informed_by Activity
ammonium_nitrogen Biosample
analysis_type Biosample
biosample_categories Biosample
bulk_elect_conductivity Biosample
collection_date_inc Biosample String
collection_time Biosample String
collection_time_inc Biosample String
dna_collect_site Biosample String
dna_cont_type Biosample JgiContTypeEnum
dna_cont_well Biosample String
dna_container_id Biosample String
dna_dnase Biosample YesNoEnum
dna_isolate_meth Biosample String
dna_organisms Biosample String
dna_project_contact Biosample String
dna_samp_id Biosample String
dna_sample_format Biosample DnaSampleFormatEnum
dna_sample_name Biosample String
dna_seq_project Biosample String
dna_seq_project_name Biosample String
dna_seq_project_pi Biosample String
dna_volume Biosample Float
dnase_rna Biosample YesNoEnum
emsl_biosample_identifiers Biosample ExternalIdentifier
env_package Biosample TextValue
experimental_factor_other Biosample String
filter_method Biosample String
igsn_biosample_identifiers Biosample ExternalIdentifier
img_identifiers Biosample ExternalIdentifier
insdc_biosample_identifiers Biosample ExternalIdentifier
isotope_exposure Biosample String
lbc_thirty Biosample
lbceq Biosample
manganese Biosample
micro_biomass_c_meth Biosample String
micro_biomass_n_meth Biosample String
microbial_biomass_c Biosample String
microbial_biomass_n Biosample String
neon_biosample_identifiers Biosample ExternalIdentifier
nitrate_nitrogen Biosample String
nitrite_nitrogen Biosample String
non_microb_biomass Biosample String
non_microb_biomass_method Biosample String
org_nitro_method Biosample String
other_treatment Biosample String
project_id Biosample String
proposal_dna Biosample String
proposal_rna Biosample String
replicate_number Biosample String
sample_shipped Biosample String
sample_type Biosample SampleTypeEnum
start_date_inc Biosample String
start_time_inc Biosample String
subsurface_depth Biosample
technical_reps Biosample String
zinc Biosample
rna_collect_site Biosample String
rna_cont_type Biosample JgiContTypeEnum
rna_cont_well Biosample String
rna_container_id Biosample String
rna_isolate_meth Biosample String
rna_organisms Biosample String
rna_project_contact Biosample String
rna_samp_id Biosample String
rna_sample_format Biosample DnaSampleFormatEnum
rna_sample_name Biosample String
rna_seq_project Biosample String
rna_seq_project_name Biosample String
rna_seq_project_pi Biosample String
rna_volume Biosample Float
dna_concentration Biosample; ProcessedSample Float
rna_concentration Biosample; ProcessedSample Float
ecosystem Biosample; Study
ecosystem_category Biosample; Study
ecosystem_subtype Biosample; Study
ecosystem_type Biosample; Study
specific_ecosystem Biosample; Study String
alternative_identifiers Biosample; Study; NamedThing; MetaboliteQuantification TRUE
functional_annotation_agg Database
data_object_type DataObject
file_size_bytes DataObject Bytes
extractant Extraction
extraction_method Extraction String
extraction_target Extraction ExtractionTargetEnum
input_mass Extraction → PlannedProcess
filter_pore_size FiltrationProcess QuantityValue
separation_method FiltrationProcess SeparationMethodEnum
subject FunctionalAnnotation
has_function FunctionalAnnotation String
metagenome_annotation_id FunctionalAnnotationAggMember
encodes GenomeFeature
end GenomeFeature Integer
feature_type GenomeFeature String
phase GenomeFeature Integer
start GenomeFeature Integer
strand GenomeFeature
display_order ImageValue
library_type LibraryPreparation LibraryTypeEnum
members_id MagBin String
total_bases MagBin String
mags_list MagsAnalysisActivity
gold_analysis_project_identifiers Meta ExternalIdentifier
metabolite_quantified MetaboliteQuantification
has_metabolite_quantifications MetabolomicsAnalysisActivity
gold_biosample_identifiers MetagenomeAnnotationActivity; MetatranscriptomeAnnotationActivity ExternalIdentifier
insdc_assembly_identifiers MetagenomeAssembly; MetatranscriptomeAssembly String
has_peptide_quantifications MetaproteomicsAnalysisActivity
duration MixingProcess → PlannedProcess
was_generated_by MULTIPLE UNRELATED CLASSES
id NamedThing String
gold_sequencing_project_identifiers OmicsProcessing ExternalIdentifier
insdc_experiment_identifiers OmicsProcessing ExternalIdentifier
omics_type OmicsProcessing
all_proteins PeptideQuantification; ProteinQuantification
best_protein PeptideQuantification; ProteinQuantification
instrument_name PlannedProcess String
processing_institution PlannedProcess
protocol_link PlannedProcess
quality_control_report PlannedProcess
volume PlannedProcess
biomaterial_purity ProcessedSample
status QualityControlReport StatusEnum
has_maximum_numeric_value QuantityValue Float
has_minimum_numeric_value QuantityValue Float
direction Reaction
left_participants Reaction
right_participants Reaction
chemical ReactionParticipant
compound SolutionComponent
concentration SolutionComponent
emsl_project_identifiers Study
gnps_task_identifiers Study ExternalIdentifier
gold_study_identifiers Study ExternalIdentifier
jgi_portal_study_identifiers Study ExternalIdentifier
mgnify_project_identifiers Study ExternalIdentifier
neon_study_identifiers Study ExternalIdentifier
notes Study String
related_identifiers Study String
study_category Study StudyCategoryEnum
insdc_bioproject_identifiers Study; OmicsProcessing ExternalIdentifier
principal_investigator Study; OmicsProcessing
websites Study; PersonValue String
contained_in SubSamplingProcess
mass SubSamplingProcess
temperature SubSamplingProcess → PlannedProcess
container_size SubSamplingProcess; FiltrationProcess
language TextValue LanguageCode
analysis_identifiers TRUE
assembly_identifiers TRUE
attribute TRUE TRUE
biosample_identifiers TRUE
emsl_identifiers TRUE
gff_coordinate TRUE
gnps_identifiers TRUE
gold_identifiers TRUE
has_participants TRUE
igsn_identifiers TRUE
insdc_identifiers TRUE
jgi_portal_identifiers TRUE
metagenome_assembly_parameter TRUE
mgnify_identifiers TRUE
neon_identifiers TRUE
omics_processing_identifiers TRUE
read_qc_analysis_statistic TRUE
study_identifiers TRUE
external_database_identifiers TRUE ExternalIdentifier
date_created ?
etl_software_version ?
insdc_analysis_identifiers ExternalIdentifier ?
insdc_secondary_sample_identifiers ExternalIdentifier ?
insdc_sra_ena_study_identifiers ExternalIdentifier ?
mgnify_analysis_identifiers ExternalIdentifier ?
model InstrumentModelEnum ?
sample_collection_month String ?
value ?
vendor InstrumentVendorEnum ?
emsl_store_temp String ?
turbomam commented 1 month ago

I was going rogue/not trusting inferences and asking to assert domains for the sake of diagram drawing.

So let's actually remove the domain assertions.