Alignment method selection - Annelida branch

sadamowi commented 7 years ago

Hi Matt,

I created this issue for us to discuss the alignment algorithm choice, as I remembered I didn't want to bury discussions in email. So, I will try to use the issue tracker more judiciously! I also got a lot of gaps in Polychaeta alignment, as you mentioned. I am reading up on the different alignment options and will try to see if there is an option that would give us a better alignment. ClustalOmega seems to be only for amino acid sequences. Ideally, we will want to consider the amino acid translation when aligning the nucleotides.

If you have any thoughts to share based upon any other alignment options you have tried, please let me know.

Cheers, Sally

m-orton commented 7 years ago

Just reposting my sources from the email: -original paper on clustalomega comparing to other alogrithms: http://msb.embopress.org/content/7/1/539 -current documentation for latest version of clustalomega: http://www.clustal.org/omega/README Another useful link as well -msa R package description (details on commands): https://bioconductor.org/packages/release/bioc/manuals/msa/man/msa.pdf

m-orton commented 7 years ago

I'm wondering if I should email Ulrich Bodenhofer? He's the main author of the msa R package. Maybe I can check if he has any suggestions on alignment parameters for us.

sadamowi commented 7 years ago

Hi Matt,

Thanks for posting this information. The original paper deals with protein sequences. Although the Clustal link indicates that it now accepts DNA sequences, that doesn't seem to be implements in the R package "msa". The msa manual indicates that for ClustalOmega, only amino acid matrices are available: "BLOSUM30", "BLOSUM40", "BLOSUM50", "BLOSUM65", "BLOSUM80", and "Gonnet". It also states: "Since ClustalOmega only allows for using built-in amino acid substitution matrices, it is hardly useful for multiple alignments of nucleotide sequences."

So, I think our options would be:

go outside R to use ClustalOmega with options suitable for nucleotides
explore using Muscle instead on the DNA sequences, using msa within R (the msa documentation indicates that there are nucleotide settings for Muscle). This could be the most feasible option for us to just switch to Muscle for getting the pipeline running again using reasonable settings for nucleotide sequences.

I am also currently working with the Polychaeta alignment in MEGA. There are still major gaps, and I am checking into the identities of several quite divergent sequences that are causing the large gaps. Possibly, one more quality filter could be needed, but I am still working on this to try to ascertain if the divergent sequences are likely legitimate biological sequences from polychaetes (e.g. as opposed to pseudogenes or major contaminants).

Actually, I have another idea for dealing with these highly divergent sequences.... is it possible to run a preliminary alignment and then a distance matrix. Then, kick out any sequences that don't have any pairwise distances less than 0.15 and then re-run the alignment after kicking out such sequences? We already decided to limit the pairs to more closely related sequences. So, it would be reasonable to kick out highly divergent sequences that are creating the alignment problems.

Do you have thoughts on any of these suggestions above?

Another option that could make sense would be for me to run a few more phyla and see what issues crops up before we make any final decision.

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

I don't think we need to contact the author of the package. I think the documentation is sufficient for us to understand what the package currently can do and not do. I'd appreciate your thoughts on the feasibility of the above suggestions. Thanks very much.

Cheers, Sally

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml

From: Matthew Orton notifications@github.com Sent: Monday, November 28, 2016 5:06:55 PM To: m-orton/R-Scripts Cc: Sarah Adamowicz; Author Subject: Re: [m-orton/R-Scripts] Alignment method selection - Annelida branch (#10)

I'm wondering if I should email Ulrich Bodenhofer? He's the main author of the msa R package. Maybe I can check if he has any suggestions on alignment parameters for us.

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/10#issuecomment-263409844, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89Ov2NtfBF586VMPFYRE3Jn96vlLlvks5rC1B_gaJpZM4K-LWm.

m-orton commented 7 years ago

Hi Sally,

I think we should try using Muscle instead since it supports nucleotide alignments. That way we can contain the alignment within the script and it should just be a matter of changing a few commands. I can also look into what parameters to use.

I can edit the script to do a preliminary alignment and remove sequences with distances greater than 0.15 before doing a final alignment. I think that is a good way of eliminating the more divergent sequences. I will try and get this done today.

Also, thank you for pointing out the issues with ClustalOmega, that was an oversight on my part for not reading through the documentation closely enough. I think this is probably the source of a lot of the alignment problems I was having with the script.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

OK that sounds great to try Muscle. I think, though, that the algorithm was not the only issue with the Polychaeta. I have tried other alignment approaches in MEGA and still got those very gappy alignments. Some gaps might be expected, as entire amino acid indels can occur in COI, but I"ve never seen a legitimate COI alignment so gappy. Those "divergent" sequences are interesting, but I think those are beyond the scope of the present project. That could actually make an interesting project for a future student... to seek and explore the nature of highly divergent barcodes. (Are these simply poorly sampled components of biodiversity up until now? Are these contaminants? pseudogenes? Is there evidence of highly divergent protein sequence and function in unusual environments?) For us, I think we don't want those sequences, as we are focusing on latitude shifts, not diversification deep in the fossil record. We also want to avoid severe saturation issues for our study (i.e. many substitutions at third codon positions would mask rate differences if we use highly divergent sequences), and hence we had maximum divergence criteria for our sister pairs.

I would be more than happy to hear any other ideas for dealing with this. If you think the two-alignment approach, with an intermediate step for ejecting highly divergent sequences, can be readily implemented, then I am on board for trying that approach. We don't need those sequences anyhow for our pairs, and kicking them out may improve the alignment. With such an approach, it may be more feasible to allow the analysis at class level for the marine phyla, allowing unidentified sequences to be included in pairs if they have genetic neighbours. Thank you very much for giving this your attention.

Cheers, Sally

sadamowi commented 7 years ago

PS. I just completed a realignment of the Polychaeta alignment2a in MEGA using Muscle after deleting 3 highly divergent sequences. I got a much better alignment. There were still some gaps, but this was at about the expected level for an taxon such as Polychaeta. I think this alignment would be fine for further analysis, especially using pairwise deletion for downstream distance analysis. So, I think that our approach discussed above is likely to be successful in R as well. I am very optimistic that we are nearly there with a robust pipeline that will work across multiple taxa. :-)

m-orton commented 7 years ago

Thats good to hear that it seems to be working better with Muscle! Another contributing factor too is that in order to grab the centroid sequences I was using ClustalOmega so the centroid sequences being chosen may not have been accurate either. I am currently working on the script to make the changes you suggested.

sadamowi commented 7 years ago

OK that's great! Good thinking to fix the centroid finder too, so everything is set with reasonable choices.

sadamowi commented 7 years ago

Dec. 5th version of the code. Matt has switched to a different package to conduct the sequence alignment. Thanks for finding that. I will inspect the settings chosen and the outputted alignments, and then this issue should be close to resolved.

m-orton commented 7 years ago

Hi Sally, just to add to the email I sent about the script changes and the new Muscle package. The package has a few parameters that can be adjusted.

Documentation of muscle algorithm with all parameters can be found here: http://www.drive5.com/muscle/muscle_userguide3.8.html If you go to section 5 it will give a list of all parameters with all defaults.

I think there are a few that may become important to our analysis: diags = TRUE - parameter will enhance the speed of the algorithm with a potential loss to accuracy, this may become important for the larger insect taxa possibly since the muscle algorithm takes a while to process. I guess the question is are we willing to sacrifice accuracy for decreased computation time? Or is this irrelevant since we may be able to use the compute canada resources?

gapopen - parameter will determine what level of stringency to assign to gaps in the alignment. We may want to consider changing this depending on the taxa being run. For instance really gappy alignments (like Polychaeta for example) we may want to consider increasing the gap open penalty so it penalizes gaps more heavily.

maxiters - parameter will determine how many iterations the muscle algorithm goes through. Apparently the default is 16 but they suggest in the documentation of Muscle that for large alignments, it is better to restrict the number of iterations to 2. They say for large alignments:

"If you have a large number of sequences (a few thousand), or they are very long, then the default settings of may be too slow for practical use. A good compromise between speed and accuracy is to run just the first two iterations of the algorithm. This is done by the option –maxiters 2, as in the following example."

m-orton commented 7 years ago

forgot to add the example they use for maxiters is for a command line not for the muscle package. In the muscle package it would be muscle(dnastringset, maxiters = 2).

sadamowi commented 7 years ago

For completeness of the records for this issue, relevant sections of email from Matt (sent Dec. 5 2016):

Script: https://github.com/m-orton/R-Scripts/blob/m-orton-Annelida-1/EvolutionaryComparisonPipelineAutomated.R

The script now uses a different package called Muscle for the alignment steps. I'll go into more detail on the issue tracker but the package can perform multiple sequence alignments on DNA using the default settings of the package with the Muscle algorithm. I have emailed the author to confirm which substitution matrix is being used but apparently the alignment command can detect DNA is being used and align accordingly. In the package vignette you'll notice the example alignment also uses DNA as well. Details can be found here on the package:

http://www.bioconductor.org/packages/release/bioc/vignettes/muscle/inst/doc/muscle-vignette.pdf https://www.bioconductor.org/packages/release/bioc/manuals/muscle/man/muscle.pdf

I tried using muscle with the msa package but R kept crashing on me when I tried to run the command so I searched for an alternative package. I have attached the results of the alignment and the final trimmed alignment of Polychaeta looks much better. There are still a few gaps I noticed but overall much improved. The new package also outputs a nice progress indicator in the console while the algorithm is running as well which I attached a screenshot of.

The alignment step in the centroid finder has been changed to using Muscle instead so the centroid finder should be reliably finding the correct centroid sequence per bin.
Two alignments of classes are now performed, a preliminary alignment and distance matrix as you had suggested that will remove sequences that are too divergent from each class (greater than 0.15). Then a final alignment with the reference sequence once these sequences have been removed.

sadamowi commented 7 years ago

Hi Matt,

I am really happy with these solutions. Thank you.

I will submit a few comments in the response to the topics you raised. This comment will deal with the gap penalty setting.

I would not suggest that we change the default gap penalty setting for now. For Polychaeta, the 2-step alignment process (omitting the sequences with no close relatives) largely solved the alignment issues. yay! While a few gaps remain in the final alignment, I think we want these gaps. For example, I notice in one sequence that there are five Ts near the end that probably should have been 4 Ts. Such base-calling errors are more frequent at the ends of the sequences. If this happens in the middle, likely this would cause a stop codon at some point in the sequence, causing it not to get a BIN. However, when this happens near one of the ends, then even if a few amino acids are changed due to the shift in reading frame, there may not be enough sequence left for a stop codon to crop in. So, what we want is a gap in such a position. By using pairwise deletion in downstream analysis, we would get the correct distances.

This issue would be more of a problem if we wanted to analyze the translated sequences as well in the future. If we wanted to do so someday, then we would have to reconsider this issue. We also might be able to scrape the amino acid sequences straight from BOLD. I'm not sure about that, but I know that the amino acid sequences are stored in BOLD. So, presumably there would be a solution. That is beyond the scope presently.

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

This thread has gotten really long. I thing that we have solved the original issue we had relating to the alignment. hooray. So, I will close this issue and start new issues for the additional specific decisions that need to be made or tasks that need to be performed. I think it will be easier if we confine each "issue" to a narrow issue rather than a complex series of issues. Great work resolving this and implementing the check for highly divergent sequences.

Cheers, Sally

m-orton / Evolutionary-Rates-Analysis-Pipeline

Alignment method selection - Annelida branch #10