m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

New Results Thread #42

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Making a thread to post updates on results.

Chilopoda results are now up on dropbox. Alignment looked great, no gaps or issues. Chilopods only had one pairing but it was a really small dataset.

sadamowi commented 7 years ago

Hi Matt,

This is great! I will do my best to look very soon at the alignments you mention.

At the class level, I selected reference sequences for those classes that generated pairs in your preliminary analysis. That list of preliminary pairs is saved as a tab in the reference sequence Excel file. So, I believe we should already be good in terms of running Chordata classes.

Do you agree?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks for checking the alignment.

Agreed, looking at the preliminary pairs I think we should be good with the Chordata classes we have run.

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for reorganizing the files in Dropbox. This is very helpful. I realized I had one file open, but I since moved that file too. We ended up with a lot of runs, and so I am very glad that we decided to go with the two major pipelines (large taxa, small taxa) on github. I believe some of the runs were due to the slight misunderstanding over the REF sequences. For example, potentially in the future (e.g. if we need to rerun things due to reviewer comments), some of the smaller arthropod classes could be run all together and/or perhaps all of Actinopterygii could be run together. Anyhow, we can consider overall how to make things more efficient for any future needed analyses. I will now look at the specific Diptera alignments mentioned above and post again shortly ...

sadamowi commented 7 years ago

Hi again Matt,

I have had a look at Diptera NA. I suggest to remove:

ABY1630, ACB0007, ACB9011, ACC9152, ACD1624, ACD9327, ACG4331, ACI6633, ACJ3130, ACJ8384, ACX9549, ACY9512 - Reason - Total of 12 bp worth of insertions about 130 bp in from the beginning of sequence. As well, I think that these sequences may be the cause of the gaps in other sequences flanking these areas.

These sequences are highly unusual for Diptera. However, this is very interesting. These could be real biological variants. I built a tree for just these sequences, and they are close relatives. Also, I blasted one and it is Cecidomyiidae, a family with an unusual breeding system (haplodiploidy). There does seem to be a very interesting pattern of highly gapped sequenced being associated with unusual breeding systems (as was the case with the unusual mite family that we detected as well). Others have pointed out associations between breeding systems and evolutionary rates (which appears possibly to include rate of indels as well). However, I think exploring this further at this time is beyond the scope of the latitude paper.

I agree with you about leaving out sequences causing 1-2 bp indels (if you have the patience for that). We previously decided to leave in such cases when the errors occurred near the end of the sequence, as the most likely explanation was base calling error as the signal weakened near the end of the sequence. Such cases of occasional indels would not influence the results because of our usage of pairwise deletion of missing data. However, sequences that contain many such cases are more problematic, because that would be a signature of a nuclear pseudogene.

After these deletions, how does the Diptera NA alignment look?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Thats interesting, I didnt know about haplodiploidy. I just got through the alignments for Diptera NA removing the bins you mentioned and also removing several other bins with 1-2 bp indels. I posted the alignments I ran today (adding Mar6 to the title of the alignment). I think the alignment looks much improved now.

Something that might be useful to you and Jacqueline, I think I found an easy to identify bins with small 1-2 bp insertions that are hard to find in large alignments. I open up the alignment in MEGA, then copy the column with the insertion over to a text editor and then use the find function in the text editor to search for nucleotides. Seems to work well.

Best Regards, Matt

m-orton commented 7 years ago

Happy to report Diptera NA is up and all results are now posted, hooray!

Also, I'm going to update shortly the weirdbins excel file with the rest of the problematic bins I have found in Insecta.

sadamowi commented 7 years ago

hooray! And, congratulations on this important milestone for the project!

sadamowi commented 7 years ago

Hi Matt,

For Perciformes, here is the list of pseudoreplicates:

("63", "1") ("67", "2") ("15", "11") ("46", "12") ("42", "17") ("46", "24") ("39", "56", "26") ("45", "29") ("33", "32")

I notice that in one case, pseudoreplicates are grouped together into a trio. In another case, pair 46 appears in two different pseudoreplicate pairs. I wanted to let you know about this in case you think the pseudoreplicate section could need tweaking for the pipeline version.

For the Excel calculations, I am creating a new trio.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

I have noticed that a few pairs have generated very extreme branch length ratios (even one case of -37). In every case that I have seen so far, this related to very closely related BINS (<2% divergence). I confirmed these very genetically similar BINs by looking at the BIN database on BOLD. So, I think the pairing function is working correctly. I knew that BINs could be formed that are closer than 2%, based upon the patterns of genetic variability, but I am a little bit surprised to see multiple cases, including BINS with close to just 1% divergence. I am wondering if perhaps the BIN threshold has been tuned for certain taxonomic groups (to give groupings closer to species). I will check with BOLD so that we have a clear understanding of this issue.

This issue of very small genetic distances being associated with highly variable branch length ratios has been discussed before in the literature. A lot of error in these estimates is expected at very small genetic distances, for example if one lineage has no or few substitutions since the divergence event.

I suggest that we consider omitting pairs with <2% divergence. What do you think? I can check for this with the 5 phyla I am working with now (all but Arthropoda). What do you think?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, yes I do agree with the <2% divergence rule. I think that would be a good idea to implement to avoid cases of really large relative outgroup distances. Would you want me to also implement this for Arthropods?

For the pseudoreplicates, I think for now we will need to check the results for duplicates but hopefully in a V2 I can fix this issue.

Best Regards, Matt

sadamowi commented 7 years ago

OK thanks Matt. I'm glad you agree. Yes, please do implement the rule that the IG distance must be at least 2%. I am checking for this in the 5 other phyla.

Best wishes, Sally

sadamowi commented 7 years ago

If a pair with a very small IG distance falls in a pseudoreplicate, I am omitting the pair with the small IG distance and retaining the other member of the pseudoreplicate grouping.

m-orton commented 7 years ago

Hi Sally, I finished coding the relative branch length, 2% and pseudoreplicate sections. I'm just wondering for the binomial test if I can use the number of positive values as successes from relative branch length as opposed to relative distance? Its just makes things easier from a coding perspective so I dont have to do both relative dist and relative branch length calculations.

I realize the values are different but the signage would be the same right for relative dist vs relative branch length?

m-orton commented 7 years ago

Also, I just tested Perciformes to test the relative branch length code and I'm getting the same values as you so I'll post an excel in Perciformes with my result so you can double check as well.

sadamowi commented 7 years ago

OK great! Thanks Matt. I was also thinking we should do a test to make sure our approaches agree. So, that's good to hear.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

I've had a look at your file and cross-checked a subset of the data. We got exactly the same values - good! It would appear that this file was for checking the branch length ratios, prior to accounting for pseudoreplicates and prior to omitted pairs with low ingroup divergence?

I'd like to suggest a few minor additions:

-I suggest to output the estimated branch lengths for each paired lineage (in addition to the signed branch length ratio). These raw data may be useful for us.

-in the pval dataframe (or another dataframe if easier), I suggest also to export a count of the positives and negatives and the median signed branch length ratio. These values are useful in addition to the p-values.

-I am also further considering two ways of calculating the ratios. The method I am more familiar with is using signed ratios. Another method I have seen in the literature is always going one group over another (e.g. always low-latitude divided by high-latitude estimated branch length). Do you have thoughts on this? I am reading about this further. This would be an easy calculation to add, as the formula is very simple compared to what we've already done. i.e. This would be in addition to the signed ratios, and we would report both. I think these two different methods are also contributing to the variability in effect sizes reported in the literature. What do you think about trying both? We could the median values from both calculations to Table 1, and these values would be helpful for the discussion.

Best wishes, Sally

sadamowi commented 7 years ago

PS. An example of the other type of ratio is Gillman et al. 2009 (e.g. see supplementary table). There are no negative values. "Opposite direction" values would fall between zero and 1. Ratios are always calculated in order of the habitat (e.g. lower latitude over higher latitude), rather than larger over smaller (plus a sign). So, I'm thinking it could be helpful for us to do the ratios both ways. Thoughts?

sadamowi commented 7 years ago

PPS. Another idea is only to do the alternative branch length ratio method for select taxonomic groups (such as fish), to enable direct comparison with other studies in the discussion. I can readily do this just for those select groups if the data are outputted so as to include the branch lengths for each sister (in addition to the signed branch length ratio). This could be a better approach in terms of keeping the manuscript more streamlined. As well, I think the signed ratios are easier to interpret. e.g. 1.05 is 5% higher rate at low latitude and -1.05 is 5% higher rate at high latitude. Thanks for any thoughts on this.

m-orton commented 7 years ago

Hi Sally, the test with Perciformes was before pseudoreplicates and low ingroup divergence just to test I was getting the same result as you (which I did, yay!). I have also incorporated those changes into the code as well and I updated the code on Github.

The minor additions you mentioned should be easy to do so I will incorporate those into the pairingresultssummary dataframe (estimated branch lengths) and pval dataframes (counts and median signed ratios).

I think in regards to using two different versions of the ratio's, if its easy to implement (which it seems like its just a simple calculation) then I think it would be a good idea to report both. I think it would only strengthen our results to do both.

I'm just wondering if we would do two different wilcoxon tests though for each set of ratio's or would it only be the signed ratio's that get inputted to the wilcoxon test? Also, can I just use the number of positive signed branch length ratios as the number of successes for the binomial? (just checking to be sure)

sadamowi commented 7 years ago

Hi Matt,

Thanks for your reply on this and for implementing those suggested edits to output more information for us.

The binomial test would only need to be performed once. For the signed ratios, it would be the number of positives compared to the total count. For the lower/higher latitude branch length ratio, it would be the number of values greater than 1 vs. the total, and this would yield the same result as for the other ratio.

For the Wilcoxon test, the lower/higher latitude branch length ratio would need to be run against a null expectation of mu=1 (rather than mu=0 - zero is for the signed ratios). The results can be somewhat different as the rankings of values can be different.

Note that I still have the manuscript file open. I should be able to pass that back to you later today.

Best wishes, Sally

m-orton commented 7 years ago

Thanks for clarifying Sally, I will make the modifications to the code today.

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally,

Just posted the new results for Actinopterygii, old fish results were moved to old results folder. There were two BINs I thought I should mention with really large insertions of 48 bp and 44 bp: ADA2828, ADC2043 so those were removed, alignment looks good now.

Actinopterygii was run with all of the new code additions which I just finished so you can see how I am now generating results: -2% divergence code -Pairing results summary shows both types of ratios and estimated branch lengths -Modified Sections 16-19 to incorporate the calculation of alternative branch length ratios - lower lat / higher lat -Pseudoreplicate averaging will now average these new branch length ratios in addition to the signed ratios -Wilcoxon tests will now be performed on these new ratios (low / high) with mu = 1. Signed still uses 0 -Pvalue dataframe will include pvalues for both wilcoxon tests, number of positive and negative signed ratios and median values for both signed and low lat / high lat ratios

There are slighlty fewer pairs than before but I am using the 2% code so that reduces number of pairs...and Im getting significant values now! 0.009543 for binomial, 0.003536 for wilcoxon (signed), 0.000281 for wilcoxon (low/high) - also seems to corroborate the fish paper findings (despite the papers flaws we talked about). So I think we made the right decision to switch to relative branch length.

Also I'm calling the new estimated branch lengths (low/high) "altRelativeBranchLength" for simplicity. The branch lengths are identical between methods when the value is positive for signed ratio's but different when negative since the new method cant have negatives. Averaged pseudoreplicate branch lengths are different because of the method we use for averaging signed ratio's (adding or subtracting 1 to avoid decimals between 0 and 1). Wilcoxon results differ I think because of this but median ratios between the two methods also differ slightly.

Lastly, got lazy with making excels from the results so I also added in a write excel command to the code that will write all results directly to a single excel file and the name will reflect the class and date when the analysis was run. Each results dataframe will also appear in a different spreadsheet of the same file. Added this command at the end of the statistics section. This should allow me to rapidly go through and regenerate results for Arthropods!

Best Regards, Matt

m-orton commented 7 years ago

This issue was moved to jmay29/lat-project#4