Closed sujaypatil96 closed 2 days ago
MVP here should be Illumina models, more work is needed across the project for pacbio and oxford. | JGI DW | InstrumentVendorEnum | InstrumentModelEnum |
---|---|---|---|
Illumina HiSeq | illumina | hiseq | |
Illumina HiSeq-HO | illumina | hiseq | |
Illumina HiSeq-Rapid | illumina | hiseq | |
Illumina HiSeq-1TB | illumina | hiseq | |
Illumina HiSeq2500 | illumina | hiseq_2500 | |
Illumina HiSeq 2500-1TB | illumina | hiseq_2500 | |
Illumina MiSeq | illumina | miseq | |
Illumina NextSeq-MO | illumina | nextseq_500 | |
Illumina NextSeq-HO | illumina | nextseq_500 | |
Illumina X10 | illumina | hiseq_x_ten | |
Illumina NovaSeq | illumina | novaseq | |
Illumina NovaSeq SP | illumina | novaseq_6000 | |
Illumina NovaSeq S4 | illumina | novaseq_6000 | |
Illumina NovaSeq S2 | illumina | novaseq_6000 |
NovaSeqX 10B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 25B need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley NovaSeqX 1.5B *need to update enum for novaseqX, not in OBI yet. Can be done post-berkeley
Thanks fro this info, @aclum. Where did you get the controlled vocabulary (so @sujay and I can revisit it in the future)?
I want to move forward quickly on this, but there are at lest two problems:
I'm sure we can work this out. @sujaypatil96 and I have been consulting the Illumina support page and OBI's representation of sequencing instruments. We'll have more insights soon.
Remember, the schema is frozen, so that may effect turn around time!
use structured aliases https://linkml.io/linkml-model/latest/docs/structured_aliases/
The CV for GOLD comes from a query of an internal JGI database table. I updated this to a table to include what it should map to. I hope this will inform if we should use structured aliases vs adding a slot to Instrument. My preference is to use the enums, since what is in gold is a display and not how JGI names their individual instruments internally. @turbomam @kheal @sujaypatil96
Thanks @aclum . Could you share a rawer form of the instrument values from the GOLD database contents? Maybe unique values (with or without counts). And the corresponding query? I know most of us wouldn't be able to run the query, but it would be good to have a record of it.
Example: https://gold.jgi.doe.gov/project?id=Gp0127656
But the relevant field (Sequencing Technology = "Illumina HiSeq 2500-1TB") isn't included in https://gold-ws.jgi.doe.gov/api/v1/projects?projectGoldId=Gp0127656 ?
The instrument vocabulary doesn't seem to be included in GOLD CVs Excel from https://gold.jgi.doe.gov/downloads
And the Sequencing Project tab in Public Studies/Biosamples/SPs/APs/Organisms Excel doesn't seem to have a column for the values we're talking about
I guess that's why you did a database query
A StructuredAlias
solution would look something like this:
name: InstrumentModelEnum
permissible_values:
hiseq_2500:
meaning: OBI:0002002
aliases:
- Illumina HiSeq 2500
structured_aliases:
- literal_form: Illumina HiSeq-1TB
alias_contexts:
- https://gold.jgi.doe.gov/
alias_predicate: RELATED_SYNONYM
- literal_form: Illumina HiSeq2500
alias_predicate: EXACT_SYNONYM
alias_contexts:
- https://gold.jgi.doe.gov/
I recommend providing mappings from the GOLD strings to both the InstrumentModelEnum and the InstrumentVendorEnum
If we used StructuredAlias
es like that, then the ETL application might have to iterate through all of the structured aliases in both enums to find the appropriate vendor and model values.
Or the ETL could pre-generate a data structure, in memory, in the opposite direction like
- Illumina HiSeq-1TB
vendor_pv: Illumina
model_pv: hiseq_2500
- Illumina HiSeq2500
vendor_pv: Illumina
model_pv: hiseq_2500
@turbomam the info is in the response body as seqMethod. The cv table is not from gold but in a system called data warehouse which is why you aren't seeing it listed as a gold cv.
info is in the response body as seqMethod
Ha, I was searching for "sequence". Should have searched for the value, "Illumina HiSeq 2500-1TB"
These are the counts for Metagenome Drafts internally. Note GOLD is doing some small manipulation sometimes and there is also not internal consistency b/w the cv table and what is in the all inclusive report. count sdm_actual_seq_model 18 NextSeq MO 41 NextSeq HO 100 HiSeq-2500 Rapid V2 134 HiSeq-2500 153 HiSeq-2500 Rapid 370 MiSeq 1203 HiSeq-2000 1614 HiSeq-2000 1TB 2069 NovaSeq 2205 NovaSeqX 3316 HiSeq-2500 1TB 11694 NovaSeq S4
JGI DW count | sdm_actual_seq_model |
---|---|
18 | NextSeq MO |
41 | NextSeq HO |
100 | HiSeq-2500 Rapid V2 |
134 | HiSeq-2500 |
153 | HiSeq-2500 Rapid |
370 | MiSeq |
1203 | HiSeq-2000 |
1614 | HiSeq-2000 1TB |
2069 | NovaSeq |
2205 | NovaSeqX |
3316 | HiSeq-2500 1TB |
11694 | NovaSeq S4 |
Update: GOLD curates a list that is different than JGI DW. Will use the subset of common ones based on the JGI DW counts for structured aliases. Illumina Illumina GA Illumina GAII Illumina GAIIe Illumina GAIIx Illumina HiScanSQ Illumina HiSeq Illumina HiSeq 1000 Illumina HiSeq 1500 Illumina HiSeq 2000 Illumina HiSeq 2500 Illumina HiSeq 2500-1TB Illumina HiSeq 2500-Rapid Illumina HiSeq 3000 Illumina HiSeq 4000 Illumina HiSeq X Ten Illumina iSeq 100 Illumina MiniSeq Illumina MiSeq Illumina NextSeq Illumina NextSeq 500 Illumina NextSeq 550 Illumina NextSeq-HO Illumina NextSeq-MO Illumina NovaSeq Illumina NovaSeq 6000 Illumina NovaSeq S2 Illumina NovaSeq S4 Illumina NovaSeq SP
Objective:
Store mappings recorded in this issue comment as "mappings" on InstrumentVendorEnum and InstrumentModelEnum.
Decide the LinkML construct to be used to store these mappings, i.e., whether to use
aliases
,mappings
orannotations
.Implications:
This is required for the work being done to "migrate" the GOLD translator to make it conformant with the berkeley schema. See https://github.com/microbiomedata/nmdc-runtime/pull/656