m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

New Results Thread #42

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Making a thread to post updates on results.

Chilopoda results are now up on dropbox. Alignment looked great, no gaps or issues. Chilopods only had one pairing but it was a really small dataset.

m-orton commented 7 years ago

Just went through and updated Mollusca, Cnidaria, Branchiopoda and Annelida server runs on dropbox with relative distance dataframes.

m-orton commented 7 years ago

Collembola results now up, alignment looks really good. No issues with this one.

sadamowi commented 7 years ago

OK great. Thank you Matt. I am starting to go through all of these results and will continue throughout this week. I am also collating the results into a results table for the manuscript. I will add "SJA" to folder names for taxa for which I have checked the final trimmed alignments and the results. Please do direct my attention towards the results any time you would like my input on any particular issue. Otherwise, I will just systematically go through the results.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

I posted the alignment results Im getting for Echinoderms, I wanted to get your thoughts on them. It appears as though Holothuroidea and Echinoidea both have certain sequences that have been amplified further downstream of the reference start position. In Holothuroidea there are some instances of 149-150 bp downstream of the reference sequence start position and in other other cases 169-170 downstream. In Echinoidea there are some instances of sequences being amplified 10, 20 and 40 bp downstream.

I think this might also be a potential explanation for why some sequences end up being shorter after trimming than reference seq length.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for posting these results. Likely, various sets of primers have been used for these taxa.

At this point, I'd like to suggest that we implement our solutions/rules consistently when encountering similar issues across taxa. Here, I suggest to implement the solution that I believe you applied to Mollusca, involving applying a final length filter (after aligning and trimming against the reference sequence).

Do you agree here?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, agreed, I will use the same length filter I used with Mollusca and should be able to post the results shortly. Also, I'm close to finishing the results for Malacostraca as well.

Best Regards, Matt

m-orton commented 7 years ago

Echinoderm full results are now up, alignments look good. Actually getting significant pvalues when you count members of all classes!

sadamowi commented 7 years ago

Thank you for re-running this, Matt. The new echinoderm alignments look very good.

m-orton commented 7 years ago

Maxillopoda results now up. I had to remove two bin sequences: ACM2241 and ACM2242 that were causing a 10 bp gap in the alignment. There were also a few cases of sequences amplified further downstream so I implemented the trimming after the trimmed alignment as with Mollusca. Alignment looks very good now.

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for preparing and posting these results. We discussed today that you will verify that the unique pairings key is working correctly. Thank you very much for checking on that. Please do let me know after you complete this check, and then I will resume my final checking of the alignments and also transfer the results to the table in the manuscript file.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, just to give a progress update on this,

I discovered the method I'm using for creating a pairing key does not create a truly unique key in large datasets. What ends up happening is in rare cases, lineages can be incorrectly paired together by the pipeline if they possess the same key and that key is shared by multiple pairings.

The good news is that Echinoderms, Maxillopoda, Mesostigmata, Branchipoda, Chilopoda, Collembola, Annelida and Cnidaria are all free of this pairing key duplication issue, I went and checked through the pairing keys for each pairing to make sure there were no duplicates. So the results for these groups should be good to go to the results table.

You were right about Mollusca, it had two pairings that had lineages swapped due to this problem causing high relative dist values. Malacostraca also had two pairings with this problem. Leps had a few each in NA, SA and EUR/AFR. So im figuring out a better solution for these groups and the rest of the groups yet to be done.

The good thing is that alignments should not be affected by this so alignments that are posted can still be confirmed. I only have to rerun from section 10 onwards since that is where the problem originates from. Currently working on a solution.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for checking into this issue in more detail and also letting me know which groups are good to go. Is it possible to generate unique keys by assigning sequential numbers?

That is very good news that this doesn't impact the alignments, as I know it was a challenge to get through those huge lep alignments.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, good news I think I came up with a way of creating a truly unique pairing key. I'll do my best to explain and I found this really interesting so this will be a long post.

The problem was that when I first filter the distance matrix according to the 0.15 criteria and create the initial pairing results dataframe I get a mess of a dataframe that looks like this below: initialpairingresults

This is why I cant assign a sequential number to each pairing to start off with because actually the lineages of each paring are not grouped together to start off with - I need a way of ordering the dataframe such that I can group lineages together.

The issue is then to figure out how to group lineages of a pairing together such that the row and column numbers (from the distance matrix) match for each lineage. Now if I try to simply order by ingroup distance alone (which is what I was doing initially) I get duplicates and lineages are incorrectly matched together: ingroupdistpair

To combat this problem I came up with a simple formula to create a pairing key that I could order by and thus group lineages together. I used ingroupdistance * (row num + col num). This seemed to work most of the time but the problem as I discovered recently is that it was possible to still have duplicates of this key so that didnt work either.

So my solution now is to use a math formula called the Cantor pairing function: http://stackoverflow.com/questions/919612/mapping-two-integers-to-one-in-a-unique-and-deterministic-way

Basically it allows you to combine two integers A and B into a unique integer C in a deterministic way such that no other combination of integers would ever give C besides A and B. It is described as: C = 1/2(A+ B)(A + B + 1) + B

So I use this pairing function to combine the row and column number of each lineage into a key that should be unique to each lineage only so that when the dataframe is ordered by key, the lineages are always correctly grouped together! pairresultscantor

The only way the pairing function will output the same value is if the row and col number have the same ordering (you can see they are swapped for each lineage) so to get the same ordering I use min and max as you can see above.

Hopefully that all makes sense lol and I should be able to get back on track with running through the remaining taxa.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for your work on solving this. This is indeed very interesting! I think that's you've come up with a creative solution for this issue.

What you explained makes sense to me! I suggest to verify at the end (particularly for the largest datasets) that all pairing keys are indeed unique, to ensure everything is working as it is supposed to work. As well, I suggest that we check a few pairs in the final output to make sure nothing is getting shuffled during these steps.

That's great that we should be back on track for soon generating the results for remaining groups. I have checked all of the available alignments (finaltrim alignments), and I agree that all are either very good or acceptable for analysis. I have transferred the results to the results table in the manuscript draft file, for those taxa you indicated were good to go.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

Would you please let Jacqueline know when the revised section of code mentioned above for creating the pairing key is posted to Github? I suggest it would be a good idea for Jacqueline also to check over the revised code. Thank you Matt and Jacqueline.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally and Jacqueline,

The revised section of code for pairing keys is now up on both versions of the script. Lines 964-992 in the small taxa version of the script and lines 1041-1069 on the large taxa version.

Sally - I've gone through and run through both Malacostraca and Mollusca again with the revised code to ensure that pairings are being generated correctly and retained the pairing key in the pairing results so that it can be double checked. You will now notice there are no high relative distances in any of these groups also. Additionally, I now order pairing results by pairing key to ensure that ingroup lineages are correctly grouped together.

This should not have any effect on alignments (since the alignments are performed at an earlier step in the code) however it will have an effect on the final results for each group considering that lineages were not being correctly grouped together in these groups previously.

Anyways I think I've managed to fix the problem and am now working on finishing up the other taxa to be done. I'm thinking that what I will do is try and get all the smaller taxa done first including all of Chordata and then once I'm finished those I will work on revising Leps and finishing the other insect orders. Let me know if you think this is a good way of doing things.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

That sounds great. Thank you again for finding this solution. And, that would be a good idea if Jacqueline would check this new section of code as well.

I think that is a good plan to work first on completing the smaller groups. Once Chordata is finished, I can work on certain sections of the draft that relate to some groups within Chordata, and Jacqueline can also run them with her phylo pipeline for comparison. While we are doing those things, you could be completing the remaining insect orders.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Results are now up for Pycnogonida, Ostracoda, Perciformes and Cypriniformes! Alignments were good for all groups. No significant values but Perciformes did have a near significant wilcoxon pvalue which I thought was interesting. Pairing results for all of these groups also have the pairing key in the results summary to ensure pairings are being generated correctly. Rest of Chordata to follow tomorrow!

Best Regards, Matt

m-orton commented 7 years ago

Alright so Chordata is now up in its entirety. I was able to do the remaining classes in one runthrough: Reptilia, Aves, Elasmobranchii, Mammalia and Amphibia. Fortunately, the alignments all look really good! Reptilia sorry to say didnt have any pairings, Amphibia had one, Mammals and Elasmobanchii had a dozen or so each but Aves pretty much had the vast majority. No significant values to report.

As a side note: the only reason I couldn't include fishes in a single Chordata runthrough was because I had to separate Perciformes and Cypriniformes due to their separate reference sequences.

Also, from now on, I think what I will do is compress the section20 files when I upload to dropbox to keep upload times and file sizes minimal.

Best Regards, Matt

m-orton commented 7 years ago

Spider results are now up as well!

With Araneae, I had to eliminate two bins: ACS2027 which had an 8 bp insertion and ACL2573 which had a 1 bp insertion. I also had to incorporate the Mollusca trimming code after the trimmed alignment since there were a few sequences shorter than reference length - most likely due to primers amplifying a different region. But the alignment looks good now after making these changes. Very good number of pairings for Araneae also!

I also checked the pairing key and found no duplicates which is a good confirmation that the new code is working since Araneae is such a large group.

Also, I made a few small edits to the code today on Github - edited a small part of the pseudoreplicate code, made a few changes to comments and fixed a minor error I encountered with the largetaxa version of the script. The details can be found in the commit comments.

Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

Thank you for these posts and progress. I look forward to going through the alignments and further filling in the Results table within the next 1-2 days. I should have mentioned earlier that I had designated additional reference sequences for a couple of the larger orders of fish and arachnids just in case you wished to run them separately due to their size. That's great we are on the homestretch with the V1 full results.

Best wishes, Sally

m-orton commented 7 years ago

Coleoptera results are now up.

Coleoptera had 4 bins which I manually removed for the alignment. ACZ1158 had a large 22 bp insertion, ACZ1156, ACO5660 and ACM2866 all had small 1 bp insertions. Over 900 pairings though! As with Araneae I implemented the Mollusca sequence trimming code after the trimmed alignment to remove shorter sequence since there were a couple in trimmed alignment.

I think the pvalues are also pretty definitive since we are not getting significance even with such a large group. Also the pairingkey solution I implemented seems to be working well.

Best Regards, Matt

m-orton commented 7 years ago

Posting my timeline for the remaining groups: Fixing Lepidoptera with pairingkey solution: 1-2 days Diptera (dividing by region): 2-3 days Hymenoptera (dividing by region): 2-3 days

Thats it, we are finished!

sadamowi commented 7 years ago

Hi Matt,

That's great we are on the home stretch!

Just to be sure there was no misunderstanding about the extra REF seq I designated for Arachnida (because Araneae was a large order), did you run the remaining Arachnida, aside from Mesostigmata and Araneae?

Thank you very much.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally, sorry I havent run the rest of Arachnida, could you resend me the ref seq for Arachnida? Couldnt find it on dropbox. I will then get the rest of Arachnida done today.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

I think the Aranaea REF sequence could be used. I gave two ref seqs for that class in case you wanted to split up the class, as there were a couple of larger groups. But perhaps it would have run all together. Thank you!

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

I did a runthrough for the rest of Arachnida and the results are now up. I think it was good to separate Araneae since it did end up being a really large group and we already have the full results for Mesostigmata. The Arachnida runthrough went well overall, there were three bins I had to remove AC05111, AAN9613 and ACK5180. All of these had 1 bp insertions, as with Araneae I implemented the Mollusca sequence trimming code after the trimmed alignment to remove shorter sequences.

Best Regards, Matt

sadamowi commented 7 years ago

Excellent. Thank you Matt.

m-orton commented 7 years ago

The new Lep AUS results are up now with no duplicates of pairing key. I also decided to fix the 1 bp insertions in the alignments by removing 8 bins: AAD6260, AAE6214, AAE6215, AAD2088. AAF4403, AAE9708, AAI1270 and AAI1271. Alignment is looking great now.

sadamowi commented 7 years ago

Super! I will be going through these new results today and tomorrow.

PS. Just to clarify ... Will there be new lep results for the other geographic regions as well at some point? Or, did the pairing issue only influence AUS?

sadamowi commented 7 years ago

Hi again Matt,

I was wondering if you have run the remaining orders in class Actinopterygii? I see results for the orders Perciformes and Cypriniformes. In this case as well, I had provided two reference sequences for this class because a couple of the orders are large. However, I am happy to have orders be run together that work well. Perhaps all other orders together except those two large ones?

Thank you for checking.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

I have looked through all the new results and added them to the results table in the manuscript draft file. Everything looks good to me!

Best wishes, Sally

m-orton commented 7 years ago

Awesome, thanks Sally. No I actually haven't run the rest of Actinopterygii yet, I think I misunderstood the reference seq file since I've only been running the taxa that have a ref sequence for them in that file. Sorry about that, I'll get through the rest the fish orders today.

sadamowi commented 7 years ago

No worries Matt. I should have explained the chosen ref seqs more clearly. Not a big issue. Thank you for running the rest of that fish class.

Best wishes, Sally

m-orton commented 7 years ago

Rest of fishes are up! Really pairing rich group with around 500 pairings. Had to remove 3 bins with small insertions: ACG9172, ADC2507 and ACS9986 but the alignment is looking good now.

jmay29 commented 7 years ago

Hello!! I just did a check of the pairing keys for a few orders and I did not get any keys that linked to more than 2 (they all appear to be unique)! I checked on both the small and large taxa pipelines.

sadamowi commented 7 years ago

Thank you for checking this new code, Jacqueline.

m-orton commented 7 years ago

Thanks Jacqueline! Just posted Lep SA, results looks good, removed AAY0441 due to a 1 bp insertion.

sadamowi commented 7 years ago

Great - on the homestretch!

sadamowi commented 7 years ago

Hi Matt,

This is just a friendly remind that for Insecta, I only selected reference sequences for the largest orders - the ones that we had discussed would need to be be run separately. There are lots of other orders within Insecta. Perhaps that those be run all together? If you need any additional reference sequences, please let me know.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, no problem, I think it should be ok to run the rest of Insecta as one group. Which of the ref sequences for Insecta could I use, could I just use the Lep ref sequence or Coleoptera ref sequence?

sadamowi commented 7 years ago

Thank you Matt. From a theoretical perspective, it would not matter which reference you use, as all of these orders currently having a REF seq are fairly related to one another and also equally phylogenetically distant from the majority of the other insect orders. However, I would advise against choosing Hymenoptera, as rates of molecular evolution are high in some members of that order. You wouldn't want to pick a reference from an insect group with a "weird" rate of molecular evolution. I suggest that Lepidoptera would be a good choice. Primers designed using Lepidoptera sequences work well across a large range of insects. Therefore, I would expect these to have relatively "typical" sequences for insects.

Best wishes, Sally

m-orton commented 7 years ago

Ok sounds good, I'll go with the Lep ref sequence then.

m-orton commented 7 years ago

Lep NA is now up, had to remove 4 bins to get the alignment looking good, AAF0261, AAC0205, AAG1173 all had 1 bp insertions while AAK7666 had a 2 bp insertion.

sadamowi commented 7 years ago

OK great - thanks Matt.

m-orton commented 7 years ago

Lep EUR/AFR is now up so Lep is now officially done! Only had to remove 1 bin AAE5820, this bin had quite a large insertion of 59 bp.

sadamowi commented 7 years ago

hooray!

m-orton commented 7 years ago

Lots of groups posted!

Insecta (not including Lep, Col, Dip and Hym orders) has been posted. There were a number of 1 bp insertions which I removed: ACI7875,ACA7403, ACT9180, ACN6142, ACW8226, ABY1171. There were also a few real indels (or what appeared to be real indels) but alignment looks good.

I also posted Diptera AUS, SA and EURAFR. No bins that I had to remove with these. EURAFR had a few 3 bp indels which I left in and I also had to use the trimming code as well. EURAFR also has significant values which is interesting for a group with over 450 pairings.

Just have to do NA and then on to the last group Hymenoptera.

sadamowi commented 7 years ago

exciting! Great news!

m-orton commented 7 years ago

Alright so Hymenoptera is now finished and up on dropbox.

A few of the regions had 3 bp indels so I left those in the alignments.

SA and AUS for Hymenoptera went smoothly without having to remove any bins. EURAFR I had to remove 1 bin ACU9389 for a 1 bp insertion. NA I had to remove several bins for 1 bp insertions: ACI5609, ACJ5404, ABV2664. ACT0024, ACV9606, ACL7936, ACX6770 and ACC4201. NA also had pvalues of 0.050 and 0.038 for binomial and wilcoxon.

Sally, could you take a look at the alignments for Diptera NA I posted when you get a chance? I assume I should remove the 1 and 2 bp insertions but there is a 9 bp insertion in particular I wanted to get your thoughts on before continuing with this group.

Also, should I be doing a run of the classes that remain for Chordata? There are a few small classes in Chordata we haven't run although they are small enough that they might not yield any pairings.