microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Master `Information` class issue #1947

Open turbomam opened 2 months ago

turbomam commented 2 months ago
turbomam commented 2 months ago

The nmdc-schema has well-developed subclasses of NamedThing for MaterialEntities and PlannedProcesses. Those two classes (including all of their subclasses) are implicitly disjoint with one another.

The nmdc-schema also has a DataObject class, which is implicitly disjoint from both MaterialEntities and PlannedProcesses, but it is not placed under any intermediate organizing class.

A reasonable grouping class would be Information. Information would be disjoint from MaterialEntities and PlannedProcesses, but like MaterialEntities, Information could be either the input into a PlannedProcesses or the output from one. A DataObject as output from a DataGeneration is explicitly modeled in the berkeley-schema-fy24.

WorkflowExecution in the berkeley-schema-fy24 also have inputs and outputs, but I can't recall right now whether these relationships are populated with DataObjects. That doesn't appear to be explicitly constrained in berkeley-schema-fy24.

A common but not completely satisfying definition of Information is "anything that decreases uncertainty". For example, one doesn't know how a PlannedProcess was executed unless information is provided. Likewise, one is uncertain of the results of a PlannedProcess until Information is observed, saved, etc.

There are multiple patterns by which information can be associated with process in a linked data model.

  1. The information values can be bound directly to the process instance
  2. They can be bound into an instance of another class with a direct relationship to the process
  3. They can be mentioned (but not bound, embedded, included etc) by linking to a file or web resource

One consideration for selecting between those patterns is whether users need the ability to search through the information, and the degree to which a constrained number of information patterns will be associated with a large number of processes. Direct search over a small, highly repeated set of information patterns is strong justification for making the information patterns first class citizens in their own table, collection, etc.

DataObjects are currently used to capture process results and follow pattern 3. The DataObjects generally (?) link to their external resource with the url slots.

The ideal modeling of Information in the nmdc-schema will take advantage of hierarchical organization and will use a minimal number of relationship patterns.

turbomam commented 2 months ago

I assume that several people will want to have input into the implementation of this issue. I would like one primary contact person. Could that be @kheal ?

turbomam commented 2 months ago
turbomam commented 2 months ago

Provide some tools for interrogating OBI (or some subset) with a large-context LLM. This could intrinsically be expensive.

input token limits via API:

Using these models through their APIs requires more coding than using them through their web interfaces, but they offer more traceability and repeatability.

Qualitatively, I feel like Claude gives better results than Gemini 1.5, but it is more expensive and harder to setup.

BBOP staff are provided with funding for ChatGPT

See also https://artificialanalysis.ai/

turbomam commented 2 months ago
wc -w obi.owl

431 137 obi.owl.txt

turbomam commented 2 months ago

Would https://curategpt.io/ be helpful?

what OBI classes can be used to model the settings applied to analytical instruments in general? do not include classes that model one specific instrument.

Screenshot 2024-04-29 at 10 59 29 PM Screenshot 2024-04-29 at 11 01 29 PM
turbomam commented 2 months ago

see also

turbomam commented 2 months ago

I am especially interested in linking slots, like url

Where is it allowed to be used?

https://microbiomedata.github.io/nmdc-schema/url/

Name Description Modifies Slot
DataObject An object that primarily consists of symbols that represent information no
ImageValue An attribute value representing an image no
Protocol   no

Where has it been used in practice?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?st ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s nmdc:url ?o .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?ot ?odt
st ot odt count
nmdc:DataObject   xsd:string 175976
nmdc:ImageValue   xsd:string 7
turbomam commented 2 months ago

There's also websites and homepage_website slots

There's a UrlValue class in the nmdc-schema but not in the berkeley-schema-fy24

turbomam commented 2 months ago
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?st ?p ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s ?p ?o .
        ?p rdfs:subPropertyOf* nmdc:websites .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?p ?ot ?odt

homepage_website does not appear to be used in the nmdc-graph-2024-04-11 GrapghDB respoitory

st p ot odt count
nmdc:Study nmdc:websites   xsd:string 35
turbomam commented 2 months ago

maybe make websites a subproperty of url

potential problems:

We should at least assert see_alsos

turbomam commented 2 months ago

To what degree do the DataObjects use the url slot?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?p (count(?do) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject ;
            ?p ?o .
    }
}
group by ?p
order by desc(count(?do))
p count
rdf:type 179528
nmdc:name 179528
dcterms:description 179528
nmdc:url 175976
nmdc:file_size_bytes 172963
nmdc:md5_checksum 169777
nmdc:data_object_type 165839
nmdc:type 164546
nmdc:was_generated_by 4847
nmdc:alternative_identifiers 146

3,552 out of 179,528 DataObjects are missing urls

turbomam commented 2 months ago

Slot analysis of the DataObjects that don't assert url in nmdc-graph-2024-04-11

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX dcterms: <http://purl.org/dc/terms/>
select *
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject .
    }
    minus {
        ?do nmdc:url ?url
    }
    optional {
        ?do nmdc:name ?name
    }
    optional {
        ?do dcterms:description ?description .
        bind(replace(?description, " for.*$", "") as ?description_pattern)
    }
    optional {
        ?do nmdc:nmdc:file_size_bytes  ?file_size_bytes
    }
    optional {
        ?do nmdc:md5_checksum ?md5_checksum
    }
    optional {
        ?do nmdc:data_object_type ?data_object_type
    }
    optional {
        ?do nmdc:type ?nmdc_type
    }
    optional {
        ?do nmdc:was_generated_by ?generator
    }
    optional {
        ?do nmdc:alternative_identifiers ?alternative_identifiers
    }
}
description_pattern Count
Assembled AGP file 44
Assembled contigs fasta 44
Assembled scaffold fasta 44
Filtered read data 1
Filtered read data stats 1
Full scan GC-MS (but not GC QExactive, which is EI-HMS) 42
High res MS with high res CID MSn (and possibly some low res MSn) 14
High res MS with high res HCD MSn 43
High res MS with high res HCD MSn and low res CID MSn 175
High res MS with low res CID MSn 116
High resolution MS spectra only 2118
Metagenome Alignment BAM file 44
Metagenome Contig Coverage Stats 44
Raw sequencer read data 822
Total Result 3552

none assert any of these either