ingest: How to handle segmented viruses

joverlee521 commented 1 month ago

Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.

Avian flu example

The segment and strain fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record.

Pull segment/strain name from NCBI Virus.
All data is processed through the usual curation pipeline.
Then split into segment metadata + sequences
Finally, loop through the segment metadata to add a n_segments column tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.

This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique strain.

Lassa example

I had originally thought we could replicate the avian flu ingest in lassa, but the lack of standardized segment and strain fields makes this difficult. @j23414 has implemented an alternative method in https://github.com/nextstrain/lassa/pull/12.

Workflow uses the usual datasets download and curation pipeline.
Use nextclade run to align records to L and S reference sequences to separate the segment sequences.
Use augur filter subset the metadata by segment sequences using accession as the metadata id column.
The original metadata + sequences file that contains all records are kept for a record of samples that failed to align to either L or S.

This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with all records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique accession.

I toyed with the idea of creating strain names for lassa, but the lack of data makes it difficult to follow our usual pattern of <location>/<sample_id>/<year>. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments.

Other viruses?

I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example.

joverlee521 commented 1 month ago

I'm not sure there's a generalized way we can add this to the ingest workflow in the pathogen-repo-guide. We might just add documentation + links to the example lassa ingest.

genehack commented 1 month ago

I'm not sure there's a generalized way we can add this to the ingest workflow in the pathogen-repo-guide. We might just add documentation + links to the example lassa ingest.

I wonder if we could pull the lassa stuff back into the guide as a segregate or subdivide rule-set in the ingest workflow, that's not included in the main Snakefile. I bet by the time both lassa and oropouche have some effort invested in them, a generic version of this step could be extracted.

joverlee521 commented 1 month ago

I wonder if we could pull the lassa stuff back into the guide as a segregate or subdivide rule-set in the ingest workflow, that's not included in the main Snakefile. I bet by the time both lassa and oropouche have some effort invested in them, a generic version of this step could be extracted.

True, it might be an alternative route to the current nextclade rules? Haven't thought through how those two might interact.

BryanTegomoh commented 1 month ago

I would find this useful for jamestown canyon virus, which seems to be a segmented virus

joverlee521 commented 1 month ago

Just learning about NCBI Dataset's Genome packages, which can include the genome sequence report to link multiple GenBank records to a single Assembly record.

Would be worth exploring these Genome packages as another method for linking segmented viruses.

The benefit of using the Genome packages is they come with an assembly accession that can be the central unique identifier instead of strain names!

(Although this doesn't look promising for Lassa, the Genome package only has 2 records)

joverlee521 commented 1 month ago

Additional conversations in Slack (1, 2) highlight the need for linking different segments of a genome to track reassortment.

Historically, we've done this by linking segments via the strain name but strain names are not guaranteed to be standardized from NCBI and we can run into duplicates. So ingest needs to do some error checking when linking segments via strain name (paraphrasing @jameshadfield and @trvrb comments in Slack):

ensuring that all the segments for a strain name different
checking that segments with the same strain name have identical metadata (date, geolocation, authors, etc)
dedup via strain name and keeping the longest sequence.

jameshadfield commented 1 month ago

Here's my best attempt at a generalised approach that will work for lassa, oropouche, avian-flu. It's largely a repeat of my comments from Slack but using real data, and there are a few discussion points. (The formatting used here is just an example, feel free to change it.)

We start with the curated data¹:

{"strain": "A/Texas/37/2024", "date":"2024-03-28", "segment":"ns","length":"865", ,"sequence":"GTGA...", "genbank_accession":"PP577940"}
{"strain": "A/Texas/37/2024", "date":"2024-03-28", "segment":"pb1","length":"2316", ,"sequence":"CAAA...", "genbank_accession":"PP577941"}

{"accession": "HM470140", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"}
{"accession": "PP357048", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"}
{"accession": "PP357049", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
{"accession": "PP357050", "strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"}

If we're lucky we'll get the segment for free via NCBI, if we're not we have to add it. Currently oropouche uses Nextclade for this, and let's us add the "segment" field:

{"accession": "HM470140", "strain": "BeH505764", "segment": "S", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"}
{"accession": "PP357048", "strain": "BeH505764", "segment": "L", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"}
{"accession": "PP357049", "strain": "BeH505764", "segment": "M", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
{"accession": "PP357050", "strain": "BeH505764", "segment": "S", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"}

If we don't want to do any strain name grouping, or the data's too inconsistent, we can stop at this point and export a single results/metadata.tsv and n results/<seg>/sequences.fasta files. The oropouche example will use "accession" as the canonical ID, which means we'll (probably?) have a duplicate "S" isolate in the above example. This step is essentially where avian-flu also stops, but it uses "strain" as the canonical ID and this seems to work well enough².

To group the above by segment I'd propose a script which grouped the rows by strain name and

checks certain metadata fields agree. At the least the date should agree, and probably country too.
(perhaps) fills in missing data by looking at other segments
resolves duplicates either via a manual approach (e.g. provided YAML), or choosing via date_updated or length, as desired³.

For the above examples the initial grouping would be something like:

data_by_strain = {"strain": "A/Texas/37/2024", "data": [
    {"segment":"ns","length":"865","bioprojects":"PRJNA1095491","sequence":"GTGA...","genbank_accession":"PP577940"},
    {"segment":"pb1","length":"2316","bioprojects":"PRJNA1095491","sequence":"CAAA...","genbank_accession":"PP577941"},
    # 6 more segments not shown here
]}

data_by_strain = {"strain": "BeH505764", "data": [
    {"segment": "S", "accession": "HM470140", "date": "1991-XX-XX", "date_released": "2011-06-30", "date_updated": "2016-07-25", "length": "693"},
    {"segment": "S", "accession": "PP357050", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "946"},
    {"segment": "L", "accession": "PP357048", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "6852"},
    {"segment": "M", "accession": "PP357049", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "length": "4385"}
]}

~which would resolve to:~ UPDATED: skip the rest of this and see the subsequent message!


{"strain": "A/Texas/37/2024", "genbank_accession":"PP577940", "segment":"ns","length":"865","bioprojects":"PRJNA1095491", "n_segments": 8, "sequence":"GTGA..."}
{"strain": "A/Texas/37/2024", "genbank_accession":"PP577941", "segment":"pb1","length":"2316","bioprojects":"PRJNA1095491", "n_segments": 8, "sequence":"CAAA..."}

{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "PP357050", "n_segments": 3, "length": "946", "segment": "S", "sequence": "AGTA..."}
{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "L:PP357048", "n_segments": 3, "length": "6852", "segment": "L", "sequence": "AGTA..."}
{"strain": "BeH505764", "date": "1991-XX-XX", "date_released": "2024-03-02", "date_updated": "2024-03-02", "accession": "PP357049", "n_segments": 3, "length": "4385", "segment": "M", "sequence": "AGTA..."}

Which can easily be expressed as 1 TSV + n FASTA files. It's essentially identical in structure to the NDJSON before we grouped by strain-name, just deduped. I think it's crucial that each segment is it's own row as it has data which is segment-specific (accession, length, perhaps others such as submission date). There are a few ways we can then use this in phylo workflows as far as I can see:

Use one metadata TSV per segment (my least preferred, but perhaps the simplest)
Use 'accession' in the FASTA header. This'll work but requires something akin to set_final_strain_name.py in the workflow.
Use 'strain' in the FASTA header and have a phylo rule which splits the metadata file into one TSV per segment. Like (1) but only needs one file on S3. My preferred choice.
Use 'strain' in the FASTA header and expose the ability for augur tools to select the appropriate metadata row. (Seems a non starter to me.)
Use one TSV row per strain, and encode segment-dependent data in some way (e.g. accession_pb1). This is what I initially proposed on slack, but I'm not so sold on it today. It'd require a step in the phylo workflow to select/rename the appropriate field. But maybe this isn't so bad?

¹ In avian-flu this is exactly the output from the curate rule. In oropouche it's the output of curate plus the spike_in_strain_names rule and expressed as NDJSON rather than TSV.

² We partition the NDJSON by segment. For any duplicated {strain x segment} entries we take the first. We then convert to TSV + FASTA and use the HA metadata TSV for all downstream analyses.

³ Notice that there's a duplicated S segment in oropouche and maximising length would choose HM470140 while selecting the most recent submission would choose PP357050. For oropouche I think there are only 14 strain names with duplicated segments, so I'd probably opt for manual resolution.

jameshadfield commented 4 weeks ago

Jover and I discussed the final bit of the above thread today and converged on the following plan:

The output of the collapse/dedup script will encode segment-specific information as {name}_{segment}, exactly as we currently do for seasonal flu and as per my original thoughts in slack, which allows a single row per strain. For the above examples this'd be:

{"strain": "A/Texas/37/2024", "n_segments": 8, "genbank_accession_ns":"PP577940", "genbank_accession_pb1":"PP577941", "sequence_ns": "GTGA...", "sequence_ns": "CAAA..."}

{"strain": "BeH505764", "n_segments": 3, "date": "1991-XX-XX", "date_released": "2024-03-02", "accession_S": "PP357050", "accession_L": "PP357048", "accession_M": "PP357049", "sequence_S": "AGTA...", ...}

although it'd be preferable to output this as TSV + n FASTAs at this point. Samples with missing strain names can use accession (by necessity they'll be one segment only). Within the phylo workflow we can either add a rule which extracts the metadata specific to this segment (e.g. "accession_M" to a node-data JSON of "accession"), or leave it as-is which will result in the situation we currently have for seasonal-flu where each segment shows the accessions (EPI_ISLs) for all of the segments analysed.

joverlee521 commented 2 weeks ago

I was poking around in Loculus because I was curious how they handled the segmented genome for CCHF. Their group_segments script does something very similar to what @jameshadfield proposed above:

Group segments by metadata
Validate each group has unique segments.

The major distinction is they use a constructed join key instead of strain name so there's no need for deduplication.

jameshadfield commented 2 weeks ago

Interesting - thanks for cross-referencing @joverlee521! If I'm reading the code correctly here's the gist of how they group sequences into potential genomes:

# CCHF config YAML
insdc_segment_specific_fields:
  - ncbiUpdateDate
  - insdcAccessionBase
  - insdcVersion
  - insdcAccessionFull
  - sraRunAccession

SPECIAL_FIELDS: Final = {"segment", "submissionId"}
shared_fields = set(all_fields) - insdc_segment_specific_fields - SPECIAL_FIELDS
group_key = tuple((field, values[field]) for field in shared_fields if values[field])

so this grouping (tuple of tuples) is seemingly all non-segment-specific metadata, including strain. This would leave the door open to multiple genomes with the same strain name as long as some metadata field differed and I'd think it's better for us to group on strain name alone and then validate each group / dedup within each group. That would also avoid the need for the constructed join key.

joverlee521 commented 2 weeks ago

I'd think it's better for us to group on strain name alone and then validate each group / dedup within each group.

Hmm, how would we handle records that have the same strain name but slightly different metadata? E.g. same strain name but different dates and/or geolocation data?

That would also avoid the need for the constructed join key.

Yeah, the stable join key is necessary for a database but not really necessary for our ingest workflow that runs through all of the data.

jameshadfield commented 2 weeks ago

Hmm, how would we handle records that have the same strain name but slightly different metadata? E.g. same strain name but different dates and/or geolocation data?

With difficulty! Depending on the frequency, perhaps by ignoring / erroring and requiring a manually curated YAML. I don't think there's any magic way to do this but all the alternatives are worse in my opinion. In such an example Loculus would treat them as separate genomes (each without a full complement of segments); phylo analyses could therefore rely on the grouping key for tangelgram-like analyses but not the strain name. For our purposes we want to rely on the strain name and thus need to deal with any tricky cases in ingest. If the number of such occurances in the data is too overwhelming we could do something like strainName = {strainName}_${segmentAccession}.

nextstrain / pathogen-repo-guide