theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaCoV] Update kraken2 viral database to fix weird RSV-B nomenclature #520

Open cimendes opened 3 months ago

cimendes commented 3 months ago

:bug:

:pencil: Describe the Issue

Copied from #436

set 2 new default Strings for kraken_target_organism String rsv_a_kraken_target_organism = "Respiratory syncytial virus" and > String rsv_b_kraken_target_organism = "Human orthopneumovirus" ⚠️ NOTE: this was done due to the old kraken2 database used in this container us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv that is used in the kraken2_theiacov WDL task. We can revisit updating this task in the future. It’s pretty central to a lot of workflows so it would be a big change to update the database with up-to-date NCBI taxonomy IDs embedded within the kraken2 database. NCBI taxonomy has been updated so that Human orthopneumovirus is equivalent to human respiratory syncytial virus: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=11250&lvl=3&lin=f&keep=1&srchmode=1&unlock

:repeat: How to Reproduce

:fishing_pole_and_fish: Expected Behavior

:floppy_disk: Version Information

:information_source: Additional Information

jrotieno commented 2 months ago

I think we also need to consider updating String? kraken_target_organism as it is not the organism name but rather the proportion, i.e. to String? kraken_target_organism_prop

jrotieno commented 2 months ago

I ran the standalone Kraken2_PE_PHB workflow here: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/74a5a714-c180-4baa-b8d0-c0b266c8dccc

I used the kraken database `"gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_08gb_20240112.tar.gz"

Observing some interesting results: RSV-A reads are preferentially classified under the species Orthopneumovirus hominis and sub-species S1 Human orthopneumovirus whereas RSV-B reads are prefenetially classified under the species Orthopneumovirus bovis and sub-species S1 Bovine orthopneumovirus and S2 Respiratory syncytial virus

Some Literature from: https://www.microbiologyresearch.org/content/journal/jgv/10.1099/0022-1317-79-12-2939 Based on G-gene, the most diverse in RSV, a comparison between HRSV and BRSV revealed similarities of 38–41% at the nucleotide level and 27–32% at the amino acid level. Based on the F-gene, also diverse but more conserved than the G-gene, the similarities between HRSV and BRSV were approximately 67–71%. The other genes are likely to be even of higher similarity than the F.

jrotieno commented 2 months ago

Pending discussions on whether to update the docker image that includes an embedded database, or have the docker image separate from the database