Open joverlee521 opened 4 months ago
I'm not sure there's a generalized way we can add this to the ingest workflow in the pathogen-repo-guide. We might just add documentation + links to the example lassa ingest.
I'm not sure there's a generalized way we can add this to the ingest workflow in the pathogen-repo-guide. We might just add documentation + links to the example lassa ingest.
I wonder if we could pull the lassa
stuff back into the guide as a segregate
or subdivide
rule-set in the ingest
workflow, that's not included in the main Snakefile. I bet by the time both lassa
and oropouche
have some effort invested in them, a generic version of this step could be extracted.
I wonder if we could pull the lassa stuff back into the guide as a segregate or subdivide rule-set in the ingest workflow, that's not included in the main Snakefile. I bet by the time both lassa and oropouche have some effort invested in them, a generic version of this step could be extracted.
True, it might be an alternative route to the current nextclade rules? Haven't thought through how those two might interact.
I would find this useful for jamestown canyon virus, which seems to be a segmented virus
Just learning about NCBI Dataset's Genome packages, which can include the genome sequence report to link multiple GenBank records to a single Assembly record.
Would be worth exploring these Genome packages as another method for linking segmented viruses.
The benefit of using the Genome packages is they come with an assembly accession that can be the central unique identifier instead of strain names!
(Although this doesn't look promising for Lassa, the Genome package only has 2 records)
Additional conversations in Slack (1, 2) highlight the need for linking different segments of a genome to track reassortment.
Historically, we've done this by linking segments via the strain name but strain names are not guaranteed to be standardized from NCBI and we can run into duplicates. So ingest needs to do some error checking when linking segments via strain name (paraphrasing @jameshadfield and @trvrb comments in Slack):
Here's my best attempt at a generalised approach that will work for lassa, oropouche, avian-flu. It's largely a repeat of my comments from Slack but using real data, and there are a few discussion points. (The formatting used here is just an example, feel free to change it.)
We start with the curated data¹:
{"strain": "A/Texas/37/2024", "date":"2024-03-28", "segment":"ns","length":"865", ,"sequence":"GTGA...", "genbank_accession":"PP577940"}
{"strain": "A/Texas/37/2024", "date":"2024-03-28", "segment":"pb1","length":"2316", ,"sequence":"CAAA...", "genbank_accession":"PP577941"}
{"accession": "HM470140", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"}
{"accession": "PP357048", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"}
{"accession": "PP357049", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
{"accession": "PP357050", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"}
If we're lucky we'll get the segment for free via NCBI, if we're not we have to add it. Currently oropouche uses Nextclade for this, and let's us add the "segment" field:
{"accession": "HM470140", "strain": "BeH505764", "segment": "S", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"}
{"accession": "PP357048", "strain": "BeH505764", "segment": "L", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"}
{"accession": "PP357049", "strain": "BeH505764", "segment": "M", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
{"accession": "PP357050", "strain": "BeH505764", "segment": "S", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"}
If we don't want to do any strain name grouping, or the data's too inconsistent, we can stop at this point and export a single results/metadata.tsv
and n results/<seg>/sequences.fasta
files. The oropouche example will use "accession" as the canonical ID, which means we'll (probably?) have a duplicate "S" isolate in the above example. This step is essentially where avian-flu also stops, but it uses "strain" as the canonical ID and this seems to work well enough².
To group the above by segment I'd propose a script which grouped the rows by strain name and
date_updated
or length
, as desired³.For the above examples the initial grouping would be something like:
data_by_strain = {"strain": "A/Texas/37/2024", "data": [
{"segment":"ns","length":"865","bioprojects":"PRJNA1095491","sequence":"GTGA...","genbank_accession":"PP577940"},
{"segment":"pb1","length":"2316","bioprojects":"PRJNA1095491","sequence":"CAAA...","genbank_accession":"PP577941"},
# 6 more segments not shown here
]}
data_by_strain = {"strain": "BeH505764", "data": [
{"segment": "S", "accession": "HM470140", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"},
{"segment": "S", "accession": "PP357050", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"},
{"segment": "L", "accession": "PP357048", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"},
{"segment": "M", "accession": "PP357049", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
]}
~which would resolve to:~ UPDATED: skip the rest of this and see the subsequent message!
{"strain": "A/Texas/37/2024", "genbank_accession":"PP577940", "segment":"ns","length":"865","bioprojects":"PRJNA1095491", "n_segments": 8, "sequence":"GTGA..."}
{"strain": "A/Texas/37/2024", "genbank_accession":"PP577941", "segment":"pb1","length":"2316","bioprojects":"PRJNA1095491", "n_segments": 8, "sequence":"CAAA..."}
{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "PP357050", "n_segments": 3, "length": "946", "segment": "S", "sequence": "AGTA..."}
{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "L:PP357048", "n_segments": 3, "length": "6852", "segment": "L", "sequence": "AGTA..."}
{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "PP357049", "n_segments": 3, "length": "4385", "segment": "M", "sequence": "AGTA..."}
Which can easily be expressed as 1 TSV + n FASTA files. It's essentially identical in structure to the NDJSON before we grouped by strain-name, just deduped. I think it's crucial that each segment is it's own row as it has data which is segment-specific (accession, length, perhaps others such as submission date). There are a few ways we can then use this in phylo workflows as far as I can see:
set_final_strain_name.py
in the workflow.augur
tools to select the appropriate metadata row. (Seems a non starter to me.)accession_pb1
). This is what I initially proposed on slack, but I'm not so sold on it today. It'd require a step in the phylo workflow to select/rename the appropriate field. But maybe this isn't so bad?¹ In avian-flu this is exactly the output from the curate
rule. In oropouche it's the output of curate
plus the spike_in_strain_names
rule and expressed as NDJSON rather than TSV.
² We partition the NDJSON by segment. For any duplicated {strain x segment} entries we take the first. We then convert to TSV + FASTA and use the HA metadata TSV for all downstream analyses.
³ Notice that there's a duplicated S segment in oropouche and maximising length would choose HM470140 while selecting the most recent submission would choose PP357050. For oropouche I think there are only 14 strain names with duplicated segments, so I'd probably opt for manual resolution.
Jover and I discussed the final bit of the above thread today and converged on the following plan:
The output of the collapse/dedup script will encode segment-specific information as {name}_{segment}
, exactly as we currently do for seasonal flu and as per my original thoughts in slack, which allows a single row per strain. For the above examples this'd be:
{"strain": "A/Texas/37/2024", "n_segments": 8, "genbank_accession_ns":"PP577940", "genbank_accession_pb1":"PP577941", "sequence_ns": "GTGA...", "sequence_ns": "CAAA..."}
{"strain": "BeH505764", "n_segments": 3, "date": "1991-XX-XX", "date_released": "2024-03-02", "accession_S": "PP357050", "accession_L": "PP357048", "accession_M": "PP357049", "sequence_S": "AGTA...", ...}
although it'd be preferable to output this as TSV + n FASTAs at this point. Samples with missing strain names can use accession (by necessity they'll be one segment only). Within the phylo workflow we can either add a rule which extracts the metadata specific to this segment (e.g. "accession_M" to a node-data JSON of "accession"), or leave it as-is which will result in the situation we currently have for seasonal-flu where each segment shows the accessions (EPI_ISLs) for all of the segments analysed.
I was poking around in Loculus because I was curious how they handled the segmented genome for CCHF. Their group_segments script does something very similar to what @jameshadfield proposed above:
The major distinction is they use a constructed join key instead of strain name so there's no need for deduplication.
Interesting - thanks for cross-referencing @joverlee521! If I'm reading the code correctly here's the gist of how they group sequences into potential genomes:
# CCHF config YAML
insdc_segment_specific_fields:
- ncbiUpdateDate
- insdcAccessionBase
- insdcVersion
- insdcAccessionFull
- sraRunAccession
SPECIAL_FIELDS: Final = {"segment", "submissionId"}
shared_fields = set(all_fields) - insdc_segment_specific_fields - SPECIAL_FIELDS
group_key = tuple((field, values[field]) for field in shared_fields if values[field])
so this grouping (tuple of tuples) is seemingly all non-segment-specific metadata, including strain. This would leave the door open to multiple genomes with the same strain name as long as some metadata field differed and I'd think it's better for us to group on strain name alone and then validate each group / dedup within each group. That would also avoid the need for the constructed join key.
I'd think it's better for us to group on strain name alone and then validate each group / dedup within each group.
Hmm, how would we handle records that have the same strain name but slightly different metadata? E.g. same strain name but different dates and/or geolocation data?
That would also avoid the need for the constructed join key.
Yeah, the stable join key is necessary for a database but not really necessary for our ingest workflow that runs through all of the data.
Hmm, how would we handle records that have the same strain name but slightly different metadata? E.g. same strain name but different dates and/or geolocation data?
With difficulty! Depending on the frequency, perhaps by ignoring / erroring and requiring a manually curated YAML. I don't think there's any magic way to do this but all the alternatives are worse in my opinion. In such an example Loculus would treat them as separate genomes (each without a full complement of segments); phylo analyses could therefore rely on the grouping key for tangelgram-like analyses but not the strain name. For our purposes we want to rely on the strain name and thus need to deal with any tricky cases in ingest. If the number of such occurances in the data is too overwhelming we could do something like strainName = {strainName}_${segmentAccession}
.
Related to https://github.com/nextstrain/private/issues/102 and https://github.com/nextstrain/pathogen-repo-guide/issues/50
Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.
Avian flu example
The
segment
andstrain
fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record.n_segments
column tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique
strain
.Lassa example
I had originally thought we could replicate the avian flu ingest in lassa, but the lack of standardized segment and strain fields makes this difficult. @j23414 has implemented an alternative method in https://github.com/nextstrain/lassa/pull/12.
datasets download
and curation pipeline.nextclade run
to align records to L and S reference sequences to separate the segment sequences.augur filter
subset the metadata by segment sequences usingaccession
as the metadata id column.This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with all records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique
accession
.I toyed with the idea of creating strain names for lassa, but the lack of data makes it difficult to follow our usual pattern of
<location>/<sample_id>/<year>
. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments.Other viruses?
I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example.