niemasd / ViralMSA

Reference-guided multiple sequence alignment of viral genomes
GNU General Public License v3.0
60 stars 9 forks source link

Reference to use for Viruses with multiple Serotypes #38

Closed Rohit-Satyam closed 1 year ago

Rohit-Satyam commented 1 year ago

What reference should one use if there are multiple serotypes of a Virus. I understand that sequences belonging to each serotypes must be aligned separately but what if there is no RefSeq available for each serotype? Kindly guide.

niemasd commented 1 year ago

Thanks for reaching out! Any high-confidence (i.e., high coverage with no/minimal N) genome with high sequence identity with your sequences of interest should work. Do you have specific sequences in mind that you would be able to share? BLASTing a few of the sequences might bubble up potential options

Rohit-Satyam commented 1 year ago

Actually I am dealing with serotypes of FNDV viruses. There are currently 7 serotypes: A, O, C, Asia 1, SAT 1, SAT 2, and SAT 3. But Refseq is only present for Serotype O only. Though there are complete assemblies for other serotypes, there is no refseq as such.

niemasd commented 1 year ago

Any complete assembly for the other serotypes should work fine, assuming it was assembled with high coverage and is high-confidence! I just default to RefSeq for the default options in ViralMSA because I can be confident in their quality, but ViralMSA supports any arbitrary reference sequences the user wishes to provide. I would suggest trying out the complete assemblies for the other serotypes if they seem high quality

Rohit-Satyam commented 1 year ago

Thanks @niemasd. I just realized that genomes of the FNDV within a single serotype diverge drastically and therefore, I will not divide them by serotype and then align. Do you have recommendations to remove such outlier genomes?

I wish to add one last point before I close this issue. I observe the following warning no matter which aligner I use for alignment

[2023-05-12 19:19:42] WARNING: Some sequences from the input are missing from the output. Perhaps try a different aligner or reference genome?

When I check the .aln files using grep '>' file.aln | wc -l I observe all the sequences to be present. Isn't this warning misleading? Just thinking.

niemasd commented 1 year ago

Thanks for following up!

I just realized that genomes of the FNDV within a single serotype diverge drastically and therefore, I will not divide them by serotype and then align. Do you have recommendations to remove such outlier genomes?

Hmm, are the outliers "real" or are they erroneous in some way? If the drastic divergence is "real", I would caution against removing them, as I imagine they would provide important information. Though I don't know details of the specific context you're working in

Given that your sequences are so divergent, perhaps it would be better to use a tool like MAFFT to perform Multiple Sequence Alignment: reference-guided approaches like ViralMSA are intended for very closely-related genomes that are very similar to the reference genome, and this reference-guided approach may not be appropriate for this specific context if the viral genomes in a single serotype are so drastically diverged. MAFFT will likely be (much) slower than ViralMSA, but hopefully it will be fast enough for your purposes!

I wish to add one last point before I close this issue. I observe the following warning no matter which aligner I use for alignment [2023-05-12 19:19:42] WARNING: Some sequences from the input are missing from the output. Perhaps try a different aligner or reference genome? When I check the .aln files using grep '>' file.aln | wc -l I observe all the sequences to be present. Isn't this warning misleading? Just thinking.

Hmm, would you mind sharing your dataset so I can double check? This warning should only be printed if the number of IDs in the final output FASTA file (excluding the reference genome, if you're having ViralMSA include it in the output) is smaller than the number of IDs in the input FASTA file:

https://github.com/niemasd/ViralMSA/blob/4c0c01cf7e9844f2bd76696c513041245befb186/ViralMSA.py#L1091-L1092

The number of input IDs (num_input_IDs) is calculated here (it just counts the number of lines starting with > in the input file, equivalent to your grep piped to wc command):

https://github.com/niemasd/ViralMSA/blob/4c0c01cf7e9844f2bd76696c513041245befb186/ViralMSA.py#L1052

https://github.com/niemasd/ViralMSA/blob/4c0c01cf7e9844f2bd76696c513041245befb186/ViralMSA.py#L158-L160

The number of output IDs (num_output_IDs) is calculated while the SAM output of the aligner is converted into a FASTA MSA (each sequence that's written to the output file increments the count by 1):

https://github.com/niemasd/ViralMSA/blob/4c0c01cf7e9844f2bd76696c513041245befb186/ViralMSA.py#L1089

https://github.com/niemasd/ViralMSA/blob/4c0c01cf7e9844f2bd76696c513041245befb186/ViralMSA.py#L1007-L1042

So it should be correct, but if you have an example file that's triggering this, I can double check

Rohit-Satyam commented 1 year ago

Yes I can share the dataset since I got this from NCBI Virus database all these sequence are publicly available. I asked my lab members about FNDV genome being similar within a serotype. Here is what they said:

Yes, genomes from the same serotype might differ. Virus are categorized via serotyping according to their surface antigens. However, genetic recombination or mutation in the viral genome can cause it to evolve quickly, resulting in significant genomic variety even within a single serotype. Even when two viral genomes of the same serotype are very similar, they may differ in their host tropism or pathogenic potential. As a result, while serotyping can be a valuable tool for determining a virus' identity, it is not always an accurate indicator of genetic diversity or potential for virulence.

When I ran pyANI, many FNDV genomes failed in terms of coverage and similarity. I also saw a similar pattern when I ran MeshClust v3 to see if I can cluster the genomes with similarity cutoff of 90%. I saw that many genomes were making their own separate cluster without any member. I checked the assembly completeness and the N content is quite low (< 1-2%) so I can't blame the assembly quality. Currently idk why those sequences are divergent but my goal here is to identify conserved regions for primer designing so maybe I will remove them for the time being.

Maybe if ViralMSA fails to align these sequences and can dump these too divergent genomes as a separate fasta file, that would be quite helpful. However for the current dataset (see attachment below) I can see that all 899 of them aligned when viewing the .sam.log file

[M::main::9.705*1.00] loaded/built the index for 1 target sequence(s)
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1
[M::mm_idx_stat::9.705*1.00] distinct minimizers: 1381 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.938; total length: 8201
[M::worker_pipeline::11.693*6.14] mapped 899 sequences
[M::main] Version: 0.1-r41
[M::main] CMD: unimap -t 112 --score-N=0 --secondary=no --sam-hit-only -a --cs -o /home/subudhak/Desktop/fmdv/results/02_refMSA_unimap/drep.fasta.sam /home/subudhak/.viralmsa/NC_039210.1.fasta_HASH_00206f45143b40458c9a417813d04c28/NC_039210.1.fasta.umi /home/subudhak/Desktop/fmdv/results/drep/drep.fasta
[M::main] Real time: 12.976 sec; CPU: 73.066 sec; Peak RSS: 7.111 GB

Command used was

./bin/ViralMSA/ViralMSA.py -e rohitsatyam102@gmail.com -s results/drep/drep.fasta -o results/02_refMSA_unimap -r raw/NC_039210.1.fasta  --aligner unimap

drep.fasta.zip

niemasd commented 1 year ago

Thanks for following up! I was able to run your dataset, but I noticed you ran it without --omit_ref, which means that the reference genome is indeed included in your output. When I run it with --omit_ref, I do indeed get 1 fewer sequences in the output (898) vs. in the input (899):

$ ViralMSA.py -e niemamoshiri@gmail.com -s drep.fasta -o tmp -r NC_039210.1.fasta --omit_ref
$ cat tmp/drep.fasta.aln | grep "^>" | wc -l
898
$ cat drep.fasta | grep "^>" | wc -l
899

To find which sequence was missing, I was able to use diff:

$ diff <(cat tmp/drep.fasta.aln | grep "^>" | sort) <(cat drep.fasta | grep "^>" | sort)
1a2
> >AF136371.1

So it seems like AF136371.1 was the problematic sequence in the bunch. When I pulled it up in the input FASTA file, I see that it's super short:

>AF136371.1
CGGGGTGATCTAGGGCCTCCTGCGCTGAGAATCGCTGCACAACTTCCTGCCTCCTTTAACTTTGGTGCAATCCAAGGCAC
GACCATCCACGAGCTCCTCGTGCGCATGAAGCGTGCCGAGCTCTACTGCCCCAGGCCACTGCTGGCAGTAGAGGTGTCTT
CTCAAGACAGACACAAACAGAAGATCATTGCACCTGCCAAACAACTCCTG

So looks like things are working as expected! If you really want to force that one to be included, MAFFT was able to run on the dataset in less than a minute:

mafft --auto --thread 4 drep.fasta > drep.fasta.mafft.aln.txt

Here's the output file: drep.fasta.mafft.aln.txt

I looked up the one problematic sequence (AF136371.1) in the resulting MAFFT alignment, and as expected, it's aligning to the rest of the sequences pretty badly: bits and pieces of it align all over the place

Regarding the read mapper's log file, the statement "mapped 899 sequences" is unfortunately a bit misleading: it means that it ran its mapping algorithm (regardless of success or fail) on 899 input sequences. It doesn't necessarily mean that all 899 successfully mapped to the reference genome: just that it ran its mapping algorithm on all of them. If you open the SAM file, you'll notice that AF136371.1 isn't in there

Regarding outputting a list of sequences that failed to map, I'll consider potentially adding that in a future update! Thanks for the suggestion!

In this case, given that the viruses are so diverged (judging by the MAFFT alignment), and given that it was a small enough dataset to run using MAFFT in reasonable time (less than 1 minute using 4 threads), I would suggest avoiding reference-guided aligners for MSA of these sequences. ViralMSA (and reference-guided alignment in general) is really meant for cases where you have a lot of sequences (i.e., infeasible for MAFFT and similar) that are near-identical to the reference genome

I hope that helps!

Rohit-Satyam commented 1 year ago

Thanks for taking out time to test my data. yes this escaped my eye when I calculated assembly stats using seqkit. I really appreciate your input on this. Having my doubts cleared, I will close this issue now. Once again Very many thanks!!

niemasd commented 1 year ago

No problem at all 😄