microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

Add `aliases` / `mappings` / `annotations` for GOLD platform / model CV #2174

Closed sujaypatil96 closed 2 days ago

sujaypatil96 commented 2 weeks ago

Objective:

Store mappings recorded in this issue comment as "mappings" on InstrumentVendorEnum and InstrumentModelEnum.

Decide the LinkML construct to be used to store these mappings, i.e., whether to use aliases, mappings or annotations.


Implications:

This is required for the work being done to "migrate" the GOLD translator to make it conformant with the berkeley schema. See https://github.com/microbiomedata/nmdc-runtime/pull/656

aclum commented 2 weeks ago
MVP here should be Illumina models, more work is needed across the project for pacbio and oxford. JGI DW InstrumentVendorEnum InstrumentModelEnum
Illumina HiSeq illumina hiseq
Illumina HiSeq-HO illumina hiseq
Illumina HiSeq-Rapid illumina hiseq
Illumina HiSeq-1TB illumina hiseq
Illumina HiSeq2500 illumina hiseq_2500
Illumina HiSeq 2500-1TB illumina hiseq_2500
Illumina MiSeq illumina miseq
Illumina NextSeq-MO illumina nextseq_500
Illumina NextSeq-HO illumina nextseq_500
Illumina X10 illumina hiseq_x_ten
Illumina NovaSeq illumina novaseq
Illumina NovaSeq SP illumina novaseq_6000
Illumina NovaSeq S4 illumina novaseq_6000
Illumina NovaSeq S2 illumina novaseq_6000

NovaSeqX 10B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 25B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 1.5B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley

turbomam commented 1 week ago

Thanks fro this info, @aclum. Where did you get the controlled vocabulary (so @sujay and I can revisit it in the future)?

I want to move forward quickly on this, but there are at lest two problems:

  1. the MVP list of models is really a mixture of instrument names, instruments + kit/flowcell names and instrument families. We need to be consistent, preferably aligning to an ontology like OBI
  2. we have an impedance mismatch: berkeley-schema-fy24 represents instruments and models. GOLD seems to represent platforms, families, models and more. We need to be intentional about how we align those two spaces with different levels.

I'm sure we can work this out. @sujaypatil96 and I have been consulting the Illumina support page and OBI's representation of sequencing instruments. We'll have more insights soon.

Remember, the schema is frozen, so that may effect turn around time!

aclum commented 5 days ago

use structured aliases https://linkml.io/linkml-model/latest/docs/structured_aliases/

aclum commented 5 days ago

The CV for GOLD comes from a query of an internal JGI database table. I updated this to a table to include what it should map to. I hope this will inform if we should use structured aliases vs adding a slot to Instrument. My preference is to use the enums, since what is in gold is a display and not how JGI names their individual instruments internally. @turbomam @kheal @sujaypatil96

turbomam commented 4 days ago

Thanks @aclum . Could you share a rawer form of the instrument values from the GOLD database contents? Maybe unique values (with or without counts). And the corresponding query? I know most of us wouldn't be able to run the query, but it would be good to have a record of it.

turbomam commented 4 days ago

Example: https://gold.jgi.doe.gov/project?id=Gp0127656

But the relevant field (Sequencing Technology = "Illumina HiSeq 2500-1TB") isn't included in https://gold-ws.jgi.doe.gov/api/v1/projects?projectGoldId=Gp0127656 ?

turbomam commented 4 days ago

see also

turbomam commented 4 days ago

The instrument vocabulary doesn't seem to be included in GOLD CVs Excel from https://gold.jgi.doe.gov/downloads

And the Sequencing Project tab in Public Studies/Biosamples/SPs/APs/Organisms Excel doesn't seem to have a column for the values we're talking about

I guess that's why you did a database query

turbomam commented 4 days ago

A StructuredAlias solution would look something like this:

name: InstrumentModelEnum
permissible_values:
  hiseq_2500:
    meaning: OBI:0002002
    aliases:
    - Illumina HiSeq 2500
    structured_aliases:
      - literal_form: Illumina HiSeq-1TB
        alias_contexts:
        -  https://gold.jgi.doe.gov/
        alias_predicate: RELATED_SYNONYM
      - literal_form: Illumina HiSeq2500
        alias_predicate: EXACT_SYNONYM
        alias_contexts:
        -  https://gold.jgi.doe.gov/
turbomam commented 4 days ago

I recommend providing mappings from the GOLD strings to both the InstrumentModelEnum and the InstrumentVendorEnum

If we used StructuredAliases like that, then the ETL application might have to iterate through all of the structured aliases in both enums to find the appropriate vendor and model values.

Or the ETL could pre-generate a data structure, in memory, in the opposite direction like

- Illumina HiSeq-1TB
    vendor_pv: Illumina
    model_pv: hiseq_2500
- Illumina HiSeq2500
    vendor_pv: Illumina
    model_pv: hiseq_2500
aclum commented 4 days ago

@turbomam the info is in the response body as seqMethod. The cv table is not from gold but in a system called data warehouse which is why you aren't seeing it listed as a gold cv.

turbomam commented 4 days ago

info is in the response body as seqMethod

Ha, I was searching for "sequence". Should have searched for the value, "Illumina HiSeq 2500-1TB"

aclum commented 4 days ago

These are the counts for Metagenome Drafts internally. Note GOLD is doing some small manipulation sometimes and there is also not internal consistency b/w the cv table and what is in the all inclusive report. count sdm_actual_seq_model 18 NextSeq MO 41 NextSeq HO 100 HiSeq-2500 Rapid V2 134 HiSeq-2500 153 HiSeq-2500 Rapid 370 MiSeq 1203 HiSeq-2000 1614 HiSeq-2000 1TB 2069 NovaSeq 2205 NovaSeqX 3316 HiSeq-2500 1TB 11694 NovaSeq S4

turbomam commented 4 days ago
JGI DW count sdm_actual_seq_model
18 NextSeq MO
41 NextSeq HO
100 HiSeq-2500 Rapid V2
134 HiSeq-2500
153 HiSeq-2500 Rapid
370 MiSeq
1203 HiSeq-2000
1614 HiSeq-2000 1TB
2069 NovaSeq
2205 NovaSeqX
3316 HiSeq-2500 1TB
11694 NovaSeq S4
aclum commented 4 days ago

Update: GOLD curates a list that is different than JGI DW. Will use the subset of common ones based on the JGI DW counts for structured aliases. Illumina Illumina GA Illumina GAII Illumina GAIIe Illumina GAIIx Illumina HiScanSQ Illumina HiSeq Illumina HiSeq 1000 Illumina HiSeq 1500 Illumina HiSeq 2000 Illumina HiSeq 2500 Illumina HiSeq 2500-1TB Illumina HiSeq 2500-Rapid Illumina HiSeq 3000 Illumina HiSeq 4000 Illumina HiSeq X Ten Illumina iSeq 100 Illumina MiniSeq Illumina MiSeq Illumina NextSeq Illumina NextSeq 500 Illumina NextSeq 550 Illumina NextSeq-HO Illumina NextSeq-MO Illumina NovaSeq Illumina NovaSeq 6000 Illumina NovaSeq S2 Illumina NovaSeq S4 Illumina NovaSeq SP