Add `aliases` / `mappings` / `annotations` for GOLD platform / model CV

sujaypatil96 commented 2 weeks ago

Objective:

Store mappings recorded in this issue comment as "mappings" on InstrumentVendorEnum and InstrumentModelEnum.

Decide the LinkML construct to be used to store these mappings, i.e., whether to use aliases, mappings or annotations.

Implications:

This is required for the work being done to "migrate" the GOLD translator to make it conformant with the berkeley schema. See https://github.com/microbiomedata/nmdc-runtime/pull/656

aclum commented 2 weeks ago

MVP here should be Illumina models, more work is needed across the project for pacbio and oxford.	JGI DW	InstrumentVendorEnum
Illumina HiSeq	illumina	hiseq
Illumina HiSeq-HO	illumina	hiseq
Illumina HiSeq-Rapid	illumina	hiseq
Illumina HiSeq-1TB	illumina	hiseq
Illumina HiSeq2500	illumina	hiseq_2500
Illumina HiSeq 2500-1TB	illumina	hiseq_2500
Illumina MiSeq	illumina	miseq
Illumina NextSeq-MO	illumina	nextseq_500
Illumina NextSeq-HO	illumina	nextseq_500
Illumina X10	illumina	hiseq_x_ten
Illumina NovaSeq	illumina	novaseq
Illumina NovaSeq SP	illumina	novaseq_6000
Illumina NovaSeq S4	illumina	novaseq_6000
Illumina NovaSeq S2	illumina	novaseq_6000

NovaSeqX 10B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 25B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 1.5B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley

turbomam commented 1 week ago

Thanks fro this info, @aclum. Where did you get the controlled vocabulary (so @sujay and I can revisit it in the future)?

I want to move forward quickly on this, but there are at lest two problems:

the MVP list of models is really a mixture of instrument names, instruments + kit/flowcell names and instrument families. We need to be consistent, preferably aligning to an ontology like OBI
we have an impedance mismatch: berkeley-schema-fy24 represents instruments and models. GOLD seems to represent platforms, families, models and more. We need to be intentional about how we align those two spaces with different levels.

I'm sure we can work this out. @sujaypatil96 and I have been consulting the Illumina support page and OBI's representation of sequencing instruments. We'll have more insights soon.

Remember, the schema is frozen, so that may effect turn around time!

aclum commented 5 days ago

use structured aliases https://linkml.io/linkml-model/latest/docs/structured_aliases/

aclum commented 5 days ago

The CV for GOLD comes from a query of an internal JGI database table. I updated this to a table to include what it should map to. I hope this will inform if we should use structured aliases vs adding a slot to Instrument. My preference is to use the enums, since what is in gold is a display and not how JGI names their individual instruments internally. @turbomam @kheal @sujaypatil96

turbomam commented 4 days ago

Thanks @aclum . Could you share a rawer form of the instrument values from the GOLD database contents? Maybe unique values (with or without counts). And the corresponding query? I know most of us wouldn't be able to run the query, but it would be good to have a record of it.

turbomam commented 4 days ago

Example: https://gold.jgi.doe.gov/project?id=Gp0127656

But the relevant field (Sequencing Technology = "Illumina HiSeq 2500-1TB") isn't included in https://gold-ws.jgi.doe.gov/api/v1/projects?projectGoldId=Gp0127656 ?

turbomam commented 4 days ago

see also

https://github.com/microbiomedata/berkeley-schema-fy24/pull/247

turbomam commented 4 days ago

The instrument vocabulary doesn't seem to be included in GOLD CVs Excel from https://gold.jgi.doe.gov/downloads

And the Sequencing Project tab in Public Studies/Biosamples/SPs/APs/Organisms Excel doesn't seem to have a column for the values we're talking about

I guess that's why you did a database query

turbomam commented 4 days ago

A StructuredAlias solution would look something like this:

name: InstrumentModelEnum
permissible_values:
  hiseq_2500:
    meaning: OBI:0002002
    aliases:
    - Illumina HiSeq 2500
    structured_aliases:
      - literal_form: Illumina HiSeq-1TB
        alias_contexts:
        -  https://gold.jgi.doe.gov/
        alias_predicate: RELATED_SYNONYM
      - literal_form: Illumina HiSeq2500
        alias_predicate: EXACT_SYNONYM
        alias_contexts:
        -  https://gold.jgi.doe.gov/

turbomam commented 4 days ago

I recommend providing mappings from the GOLD strings to both the InstrumentModelEnum and the InstrumentVendorEnum

If we used StructuredAliases like that, then the ETL application might have to iterate through all of the structured aliases in both enums to find the appropriate vendor and model values.

Or the ETL could pre-generate a data structure, in memory, in the opposite direction like

- Illumina HiSeq-1TB
    vendor_pv: Illumina
    model_pv: hiseq_2500
- Illumina HiSeq2500
    vendor_pv: Illumina
    model_pv: hiseq_2500

aclum commented 4 days ago

@turbomam the info is in the response body as seqMethod. The cv table is not from gold but in a system called data warehouse which is why you aren't seeing it listed as a gold cv.

turbomam commented 4 days ago

info is in the response body as seqMethod

Ha, I was searching for "sequence". Should have searched for the value, "Illumina HiSeq 2500-1TB"

aclum commented 4 days ago

These are the counts for Metagenome Drafts internally. Note GOLD is doing some small manipulation sometimes and there is also not internal consistency b/w the cv table and what is in the all inclusive report. count sdm_actual_seq_model 18 NextSeq MO 41 NextSeq HO 100 HiSeq-2500 Rapid V2 134 HiSeq-2500 153 HiSeq-2500 Rapid 370 MiSeq 1203 HiSeq-2000 1614 HiSeq-2000 1TB 2069 NovaSeq 2205 NovaSeqX 3316 HiSeq-2500 1TB 11694 NovaSeq S4

turbomam commented 4 days ago

JGI DW count	sdm_actual_seq_model
18	NextSeq MO
41	NextSeq HO
100	HiSeq-2500 Rapid V2
134	HiSeq-2500
153	HiSeq-2500 Rapid
370	MiSeq
1203	HiSeq-2000
1614	HiSeq-2000 1TB
2069	NovaSeq
2205	NovaSeqX
3316	HiSeq-2500 1TB
11694	NovaSeq S4

aclum commented 4 days ago

Update: GOLD curates a list that is different than JGI DW. Will use the subset of common ones based on the JGI DW counts for structured aliases. Illumina Illumina GA Illumina GAII Illumina GAIIe Illumina GAIIx Illumina HiScanSQ Illumina HiSeq Illumina HiSeq 1000 Illumina HiSeq 1500 Illumina HiSeq 2000 Illumina HiSeq 2500 Illumina HiSeq 2500-1TB Illumina HiSeq 2500-Rapid Illumina HiSeq 3000 Illumina HiSeq 4000 Illumina HiSeq X Ten Illumina iSeq 100 Illumina MiniSeq Illumina MiSeq Illumina NextSeq Illumina NextSeq 500 Illumina NextSeq 550 Illumina NextSeq-HO Illumina NextSeq-MO Illumina NovaSeq Illumina NovaSeq 6000 Illumina NovaSeq S2 Illumina NovaSeq S4 Illumina NovaSeq SP

microbiomedata / nmdc-schema

Add `aliases` / `mappings` / `annotations` for GOLD platform / model CV #2174