m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Discussion on running the phylum Mollusca #24

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Dear Matt,

Through email (Dec. 8th), I just sent you an updated Excel file that contains reference sequences for Mollusca. As we previously discussed for other phyla, I only selected references for those classes that had candidate pairs according to your preliminary analysis.

Note that for three of the classes, the sequences are 658 bp, as typical for many animals. For two classes, the sequences are 661 bp. These sequences are amplified by primers that bind in the typical "Folmer" binding region of COI. The Folmer primers and many slight variants of those primers bind there. Some molluscs truly have amino acid indels (insertions or deletions) compared to the most common length for this gene region. Note the difference is a multiple of 3 nucleotides, reflecting one amino acid difference. I think the 661 bp sequences are real biological sequences for the barcode region. I suggest to keep the end trimming to the same number of nucleotides as the 658 bp sequences. This will keep the same gene region for comparison across taxonomic groups and will also keep the same amount of trimming in terms of number of nucleotides. The expectation would that there would be a slight length different in the final alignments among the classes. But I think that is OK. Evolutionarily, that is the same gene region being compared.

Depending upon how much time you have, you could either try running this in the near future, and possibly playing with the alignment settings (e.g. maxiters 2), to see how that influences the mollusc alignments. Or, you could wait for me to get back to you about my findings for the other phyla regarding alignment settings. I have a hunch (to be confirmed!) that we may be able to get away with those less stringent settings. COI is protein-coding and is a relatively conserved marker. Alignment settings are much more important and affect the result much more when non-coding regions are being aligned or when very evolutionarily distant taxa or ancient gene duplications are being investigated.

Another note ... most of the classes have far fewer than our original sample size cut-off of BINs for analyzing in one alignment (fewer than 2000 BINs). The only exception within Mollusca is the class Gastropoda, with ~11.5K BINs. So, you could consider analyzing all the other classes first, just subtracting Gastropoda from the dataset, if you don't want the analysis to get stuck on preliminary trial runs.

Let me know how it goes! Thank you very much for running this phylum.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks for your input on Mollusca and for getting the reference sequences ready for it.

I think what i'll do is a runthrough of Mollusca with the alignment on default settings. If the computation time is too lengthy then I'll consider reducing iterations.

I plan to do a runthrough shortly so ill keep you posted!

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally,

Just sent the results I have for Mollusca. Ive had the opportunity to runthrough Mollusca twice now, once with maxiters set to 2 and once with maxiters set to 3. With maxiters set to 2, the alignments run very quickly within a few minutes but with an increase of maxiters to 3, the computation increases to several hours.

I generated 130 unique pairings which is slightly more than the original estimate I had for Mollusca at 123 pairings. I should also mention this is after I updated to very secure dishes.

I found on both occasions the alignments for Bivalvia and Gastropoda to be very gappy even with the removal of divergent sequences. However I did notice an improvement in the alignment changing from maxiters=2 to maxiters=3. The other classes of Cephalopoda, Polyplacophora and Caudofoveata seem to align quite nicely. I was wondering if I could get your feedback on how to improve the alignments for these taxa.

I also need your feedback on how to average the pseudoreplicate results being generated. I have attached the pseudoreplicate result in the excel file with the pairing results. You will notice that for there are instances where multiple ingrouppairings are associated with another ingrouppairing.

For instance 80 is closer to 84, 81 is closer to 84 and then 84 is closer to 98. So in that case, would 80, 81 and 98 all be averaged with 84 together or do they get averaged separately?

In another instance, 100 is closer to 81 and 96 is closer to 100. So does 100 get averaged with both of these pairings together or are they averaged separately. Im a little confused on how to average in these instances and was wondering how I should modify the code for averaging?

Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

Thank you very much for your progress on this phylum and for sharing your files. For the three smaller classes (Cephalopoda, Caudofoveata, and Polyplacophora), I agree that the alignments look great. I looked at these in MEGA and also translated them. All good.

That is good news that the run time is just a few hours (not days) even when there was a large class in this dataset. I think that will be a positive thing for our Mollusca results.

I look at the alignment for Gastropoda, and indeed -- as you indicated -- it is highly problematic. I looked at the first sequence in the alignment that was causing problems and found out it is a 28S ribosomal gene sequence, not COI. BOLD does house multiple markers. Therefore, I suggest to ensure that the API download was done correctly, only using the marker COI-5P. (It is also possible that someone misspecified the marker and that BOLD doesn't catch that; but I suggest to check your API first -- hopefully this is an easy fix!)

At first, I was a bit surprised that such highly divergent sequences weren't caught by the step for ejecting highly divergent sequences. However, there do appear to be multiple ribosomal sequences in the dataset, and so these wouldn't have been evicted by our approach to the divergent sequences. Those sequences found a match to each other according to our criteria, even though they were highly divergent from the result of the dataset.

If this can be fixed by correcting the API download then great! If that doesn't do it (e.g. BOLD has the marker stored wrong), then what we could do is impose a distance criterion against the reference. This would be quite liberal, and the aim would be to kick out sequences that are the wrong gene. If you think the API phrasing was fine, then perhaps we can consider this option, and I would do some distance calculations to figure out what our distance criterion would be. For example, we might say that to be kept, sequences must be less than 0.50 divergent from the reference. Again, that is a guess, and I'd have to have a look at what makes sense.

Also, if we find that BOLD has some markers stored incorrectly, then we should alert BOLD. I can contact them once we sort this out.

I found the same problem in the Bivalvia alignment.

I will reply separately about the other issues you raised.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Thanks for your response and your input on the alignment. I was unaware BOLD also used ribosomal DNA for some entries so Im glad you caught this. The good news is I was able to remove the ribosomal sequences by filtering according to the marker column in the initial tsv by selecting only those entries with COI-5P. I wanted to filter directly in the url by inserting "&marker=COI-5P" but this command doesnt seem to be working with the BOLD API for some reason.

I redid the alignment with maxiters=2 to quickly go through each alignment and I think the result is greatly improved for both Bivalvia and Gastropoda. The alignments of the other classes seem to remain untouched. Im going to send you the updated alignments in an email.

Just let me know if you think I should increase maxiters to 3 for the final results or if you think maxiters=2 is good enough for the alignments.

Best Regards, Matt

sadamowi commented 7 years ago

From Matt:

Hi Sally, Thanks for your response and your input on the alignment. I was unaware BOLD also used ribosomal DNA for some entries so Im glad you caught this. The good news is I was able to remove the ribosomal sequences by filtering according to the marker column in the initial tsv by selecting only those entries with COI-5P. I wanted to filter directly in the url by inserting "&marker=COI-5P" but this command doesnt seem to be working with the BOLD API for some reason. I redid the alignment with maxiters=2 to quickly go through each alignment and I think the result is greatly improved for both Bivalvia and Gastropoda. The alignments of the other classes seem to remain untouched. Im going to send you the updated alignments in an email. Just let me know if you think I should increase maxiters to 3 for the final results or if you think maxiters=2 is good enough for the alignments. Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for addressing this issue of the markers. I'd like to suggest to incorporate this edit into the Annelida branch as well, as this has been our key working branch for solving these issues.

(As another note, this is interesting to know that it's possible to download everything from BOLD and then filter by marker. Later on, possibly after review, we might wish to run alternative markers. Reviewers may ask to see COI compared with other markers. The dataset will be much smaller for other genes, but that could be interesting. I suggest we leave this topic aside for now.)

The alignments for Gastropoda and Bivalvia are greatly improved. Some of the residual gaps are relating to a small number of sequences with an editing error, but a small number of gaps seem to be suboptimally placed. If it isn't too onerous, is it possible to run Mollusca with maxiters3 as well? We could compare the alignments and also, importantly, compare the results. i.e.: Does this choice make a difference?

For Annelida, I have found that imposing no constraint vs. maxiters2 made a very slight but nearly negligible difference in the results. For that trial, I left maxiters to 2 for the centroid finder portion of the code (as in the current Github version) and then went with no constraint vs. maxiters2 for the second and third alignments.

I would expect Mollusca to be the trickiest group in terms of alignments, relating to the deep evolutionary history as well as the indels. If maxiters=2 (vs. a higher number) makes no to very minimal difference for this group, then we will be likely safe to go with maxiters2 for heading into the Chordata and Arthropoda. However, given the issues with Mollusca, I think we would be justified in going with more iterations, if needed, for that phylum and then could still justify using a different setting for other groups. I will also explore Cnidaria and Echinodermata.

Best wishes, Sally

sadamowi commented 7 years ago

Does the confusing pseudoreplicate issue remain after removal of the ribosomal sequences?

m-orton commented 7 years ago

Hi Sally,

No problem, i'll do a run of Mollusca on maxiters=3 and I'll keep the results from maxiters=2 so we have both result sets to compare. I'll also update the Annelida branch with filtering according to COI-5P

Looking at the results from Mollusca, I only get 14 pseudoreplicate pairs now, each pair having unique ingrouppairings so it seems this is no longer an issue for now.

Best Regards, Matt

m-orton commented 7 years ago

Also, one thing I should mention is the new code for filtering according to COI-5P requires another download of the tsv. This is because the markercode column (the column identifying which genetic marker is being used) has to now be included in the initial tsv dataframe download.

Best, Matt

m-orton commented 7 years ago

Hi Sally, just letting you know I made a google docs for the results of the pipeline which I shared with your email address.

I was able to run Mollusca on maxiters 3 and the result appears to be pretty much the same in terms of the alignment and pairing results. I've posted my result sets for both max=2 and max=3 in the google doc within a folder called Mollusca. This also includes the alignment fas documents.

Best, Matt

sadamowi commented 7 years ago

OK - thank you Matt for trying maxiters = 3.

I will post results there too over the coming days.

I am finding that maxiters = 3 and diags = TRUE (I'm using that for all three alignments) is working well for Annelida.

I am now working on the issue of the best length for trimming and will post to that issue soon too. I will then confirm results from the alignment and trimming parameters decisions with Cnidaria and Echnidermata.

Then, hopefully we'll be good to go for saving and/or generating the final results!

Best wishes,

Sally

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Matthew Orton notifications@github.com Sent: Wednesday, December 14, 2016 11:12:49 PM To: m-orton/R-Scripts Cc: Sarah Adamowicz; Author Subject: Re: [m-orton/R-Scripts] Discussion on running the phylum Mollusca (#24)

Hi Sally, just letting you know I made a google docs for the results of the pipeline which I shared with your email address.

I was able to run Mollusca on maxiters 3 and the result appears to be pretty much the same in terms of the alignment and pairing results. I've posted my result sets for both max=2 and max=3 in the google doc within a folder called Mollusca. This also includes the alignment fas documents.

Best, Matt

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/24#issuecomment-267234327, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89Oo4O4nAQwqW3zFvz1To1O3LiZB0yks5rIL5BgaJpZM4LIfZD.

sadamowi commented 7 years ago

Hi Matt,

I believe that the Mollusca-specific issues are resolved, such as uncovering the need to account for multiple markers being present on BOLD. Thank you very much for addressing that issue and for running Mollusca using various settings. I believe this issue can be closed. We can consider the Mollusca results, together with other results, for finalizing the alignment settings discussion.

Best wishes, Sally