roblanf / sarscov2alignments

0 stars 0 forks source link

Shameless plug for my new tool, ViralMSA #1

Open niemasd opened 3 years ago

niemasd commented 3 years ago

Saw that you created this repo, so I wanted to throw in a shameless plug for my new tool, ViralMSA 😄 It performs reference-guided MSA by wrapping around Minimap2. You can find the manuscript here:

https://doi.org/10.1093/bioinformatics/btaa743

Once you have minimap2 in your PATH, you can run ViralMSA as follows:

ViralMSA.py -e email@address.com -s sequences.fas -o output -r SARS-CoV-2

Last time I tested it, it aligned ~100K full genomes in ~1 minute. Note that it omits insertions wrt the reference genome (but you can choose a reference that has insertions to combat this)

roblanf commented 3 years ago

Thanks for pointing it out. Using read mappers is a neat idea. Do you have any examples where the alignments you get for SARS-CoV-2 are superior to what I do here (which is basically using MAFFT as one would use a read mapper, to do lots of pairwise alignments to a reference).

Speed isn't an issue with what I'm doing - I can already align 100K genomes in minutes. So while ViralMSA would be quicker, alignment is so far from being the rate determining step that I don't need to worry.

But if there are reasons to prefer ViralMSA on grounds of accuracy with SARS-CoV-2 data, I'm certainly interested. I regularly look at the alignments by eye, and I have to say I haven't noticed a single alignment issue with MAFFT. That's not to say that they don't exist, just that they are relatively rare.

niemasd commented 3 years ago

In the simulations I ran with the curated alignments from LANL for Ebola, HCV, and HIV-1, the alignments I obtained with ViralMSA had slightly less accuracy with respect to estimated pairwise distances (computed using the tn93 tool of HIV-TRACE), but trees I inferred from the ViralMSA alignments had far lower unweighted RF distance from the trees I inferred from the true curated alignments than did the trees I inferred from the MAFFT alignments, suggesting that ViralMSA was potentially the better choice for downstream phylogenetic analysis. You can find the specific measurements of accuracy in the paper

That being said, the experiment was very limited (just 3 curated alignments from LANL), and it was using viral datasets containing sequences that were far more divergent than the SARS-CoV-2 datasets I've seen on GISAID. The only thing I would consider might be a benefit of using ViralMSA wrapping around Minimap2 over MAFFT (though this is 100% speculation) is that it may handle sequencing errors a bit better given that Minimap2 is designed to work with error-prone long read technologies

roblanf commented 3 years ago

Got it. OK. Will give it some thought. Given the way I use MAFFT (which is just pairwise to a reference, which works great for SARS-CoV-2) I suspect the results will be almost identical.

Another question - I know you ignore insertions (I do too), but do you have a built-in way of keeping track of them in ViralMSA? E.g. a VCF as additional output that lists all insertions relative to the reference.

Even better (just getting into wish-list territory here) would be a something where you could iteratively update the reference sequence with insertions and/or variants that pass certain thresholds. E.g. to put ambiguities into the reference for common variants, and to put insertions into the reference for common insertions. When we start to get into that sort of territory, I think there's a lot of power that you could get very quickly...

On Tue, 1 Dec 2020 at 17:00, Niema Moshiri notifications@github.com wrote:

In the simulations I ran with the curated alignments from LANL for Ebola https://github.com/niemasd/ViralMSA-Paper/tree/master/data/EBOLA, HCV https://github.com/niemasd/ViralMSA-Paper/tree/master/data/HCV, and HIV-1 https://github.com/niemasd/ViralMSA-Paper/tree/master/data/HIV1, the alignments I obtained with ViralMSA had slightly less accuracy with respect to estimated pairwise distances (computed using the tn93 https://github.com/veg/tn93 tool of HIV-TRACE https://github.com/veg/hivtrace), but trees I inferred from the ViralMSA alignments had far lower unweighted RF distance from the trees I inferred from the true curated alignments than did the trees I inferred from the MAFFT alignments, suggesting that ViralMSA was potentially the better chose for downstream phylogenetic analysis

That being said, the experiment was very limited (just 3 curated alignments from LANL), and it was using viral datasets containing sequences that were far more divergent than the SARS-CoV-2 datasets I've seen on GISAID. The only thing I would consider might be a benefit of using ViralMSA wrapping around Minimap2 over MAFFT (though this is 100% speculation) is that it may handle sequencing errors given that Minimap2 is designed to work with error-prone long read technologies

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/roblanf/sarscov2alignments/issues/1#issuecomment-736241991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SE5ZGWWWTK6PGTBNTZDSSSA6VANCNFSM4UIP3KFQ .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

niemasd commented 3 years ago

do you have a built-in way of keeping track of them in ViralMSA? E.g. a VCF as additional output that lists all insertions relative to the reference.

Not yet, but that's on my to-do list 😄 One idea I had was the following: for each potential insertion site, create a mini-sequence dataset containing just the insertion sequences from each sample, do regular MSA on just those (e.g. with MAFFT), and stick the MSA of the insertions back into the insertion site

Even better (just getting into wish-list territory here) would be a something where you could iteratively update the reference sequence with insertions and/or variants that pass certain thresholds. E.g. to put ambiguities into the reference for common variants, and to put insertions into the reference for common insertions. When we start to get into that sort of territory, I think there's a lot of power that you could get very quickly...

I agree that this would be super interesting and generally useful for reference-guided MSA. I haven't looked into it too much myself, and if I go the route of developing something nice, it would probably be as its own tool, e.g. "Given multiple potential reference genomes, somehow make the 'best' supergenome that includes as many insertions as possible" or something. I think it would require its own nontrivial work --> tool because it may be good to generalize for things like recombination