Open turbomam opened 2 months ago
The nmdc-schema has well-developed subclasses of NamedThing for MaterialEntities and PlannedProcesses. Those two classes (including all of their subclasses) are implicitly disjoint with one another.
The nmdc-schema also has a DataObject class, which is implicitly disjoint from both MaterialEntities and PlannedProcesses, but it is not placed under any intermediate organizing class.
A reasonable grouping class would be Information
. Information would be disjoint from MaterialEntities and PlannedProcesses, but like MaterialEntities, Information could be either the input into a PlannedProcesses or the output from one. A DataObject as output from a DataGeneration is explicitly modeled in the berkeley-schema-fy24.
WorkflowExecution in the berkeley-schema-fy24 also have inputs and outputs, but I can't recall right now whether these relationships are populated with DataObjects. That doesn't appear to be explicitly constrained in berkeley-schema-fy24.
A common but not completely satisfying definition of Information is "anything that decreases uncertainty". For example, one doesn't know how a PlannedProcess was executed unless information is provided. Likewise, one is uncertain of the results of a PlannedProcess until Information is observed, saved, etc.
There are multiple patterns by which information can be associated with process in a linked data model.
One consideration for selecting between those patterns is whether users need the ability to search through the information, and the degree to which a constrained number of information patterns will be associated with a large number of processes. Direct search over a small, highly repeated set of information patterns is strong justification for making the information patterns first class citizens in their own table, collection, etc.
DataObjects are currently used to capture process results and follow pattern 3. The DataObjects generally (?) link to their external resource with the url slots.
The ideal modeling of Information in the nmdc-schema will take advantage of hierarchical organization and will use a minimal number of relationship patterns.
I assume that several people will want to have input into the implementation of this issue. I would like one primary contact person. Could that be @kheal ?
Provide some tools for interrogating OBI (or some subset) with a large-context LLM. This could intrinsically be expensive.
input token limits via API:
Using these models through their APIs requires more coding than using them through their web interfaces, but they offer more traceability and repeatability.
Qualitatively, I feel like Claude gives better results than Gemini 1.5, but it is more expensive and harder to setup.
BBOP staff are provided with funding for ChatGPT
See also https://artificialanalysis.ai/
wc -w obi.owl
431 137 obi.owl.txt
Would https://curategpt.io/ be helpful?
what OBI classes can be used to model the settings applied to analytical instruments in general? do not include classes that model one specific instrument.
see also
I am especially interested in linking slots, like url
Where is it allowed to be used?
https://microbiomedata.github.io/nmdc-schema/url/
Name | Description | Modifies Slot |
---|---|---|
DataObject | An object that primarily consists of symbols that represent information | no |
ImageValue | An attribute value representing an image | no |
Protocol | no |
Where has it been used in practice?
PREFIX nmdc: <https://w3id.org/nmdc/>
select
?st ?ot ?odt (count(?s) as ?count)
where {
graph <https://api.microbiomedata.org> {
?s nmdc:url ?o .
optional {
?s a ?st
}
optional {
?o a ?ot
}
BIND (IF(isIRI(?o), "IRI",
IF(isLiteral(?o), str(datatype(?o)), "Unknown"))
AS ?odt)
}
}
group by ?st ?ot ?odt
st | ot | odt | count |
---|---|---|---|
nmdc:DataObject | xsd:string | 175976 | |
nmdc:ImageValue | xsd:string | 7 |
There's also websites and homepage_website slots
There's a UrlValue class in the nmdc-schema but not in the berkeley-schema-fy24
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?st ?p ?ot ?odt (count(?s) as ?count)
where {
graph <https://api.microbiomedata.org> {
?s ?p ?o .
?p rdfs:subPropertyOf* nmdc:websites .
optional {
?s a ?st
}
optional {
?o a ?ot
}
BIND (IF(isIRI(?o), "IRI",
IF(isLiteral(?o), str(datatype(?o)), "Unknown"))
AS ?odt)
}
}
group by ?st ?p ?ot ?odt
homepage_website does not appear to be used in the nmdc-graph-2024-04-11 GrapghDB respoitory
st | p | ot | odt | count |
---|---|---|---|---|
nmdc:Study | nmdc:websites | xsd:string | 35 |
maybe make websites
a subproperty of url
potential problems:
url
is single valued and websites
is multi-valuedwebsites
has a pattern constraintWe should at least assert see_also
s
To what degree do the DataObject
s use the url
slot?
PREFIX nmdc: <https://w3id.org/nmdc/>
select
?p (count(?do) as ?count)
where {
graph <https://api.microbiomedata.org> {
?do a nmdc:DataObject ;
?p ?o .
}
}
group by ?p
order by desc(count(?do))
p | count |
---|---|
rdf:type | 179528 |
nmdc:name | 179528 |
dcterms:description | 179528 |
nmdc:url | 175976 |
nmdc:file_size_bytes | 172963 |
nmdc:md5_checksum | 169777 |
nmdc:data_object_type | 165839 |
nmdc:type | 164546 |
nmdc:was_generated_by | 4847 |
nmdc:alternative_identifiers | 146 |
3,552 out of 179,528 DataObject
s are missing url
s
Slot analysis of the DataObject
s that don't assert url
in nmdc-graph-2024-04-11
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX dcterms: <http://purl.org/dc/terms/>
select *
where {
graph <https://api.microbiomedata.org> {
?do a nmdc:DataObject .
}
minus {
?do nmdc:url ?url
}
optional {
?do nmdc:name ?name
}
optional {
?do dcterms:description ?description .
bind(replace(?description, " for.*$", "") as ?description_pattern)
}
optional {
?do nmdc:nmdc:file_size_bytes ?file_size_bytes
}
optional {
?do nmdc:md5_checksum ?md5_checksum
}
optional {
?do nmdc:data_object_type ?data_object_type
}
optional {
?do nmdc:type ?nmdc_type
}
optional {
?do nmdc:was_generated_by ?generator
}
optional {
?do nmdc:alternative_identifiers ?alternative_identifiers
}
}
description_pattern | Count |
---|---|
Assembled AGP file | 44 |
Assembled contigs fasta | 44 |
Assembled scaffold fasta | 44 |
Filtered read data | 1 |
Filtered read data stats | 1 |
Full scan GC-MS (but not GC QExactive, which is EI-HMS) | 42 |
High res MS with high res CID MSn (and possibly some low res MSn) | 14 |
High res MS with high res HCD MSn | 43 |
High res MS with high res HCD MSn and low res CID MSn | 175 |
High res MS with low res CID MSn | 116 |
High resolution MS spectra only | 2118 |
Metagenome Alignment BAM file | 44 |
Metagenome Contig Coverage Stats | 44 |
Raw sequencer read data | 822 |
Total Result | 3552 |
none assert any of these either
Information
class as a direct subclass ofNamedThing
DataObject
a direct subclass ofInformation
.Configuration
class as a direct subclass ofNamedThing
Calibration
class as a subclass (possibly indirect) ofInformation
.Calibration
to other classeshas_input
orhas_output