m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Sister vs. Phylo pipeline comparison #43

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Hi Jacqueline,

Matt has now completed analysis for the majority of the taxa using the sister pipeline. As we previously discussed, it would be very helpful if you would please run a few groups using your phylo pipeline for comparison.

I'd like to suggest to start by separately running Cypriniformes and Perciformes. These are both large orders of fish with a lot of barcode records. However, by the end alignment ("finaltrim"), there were 1100-1200 sequences left, and so I think these should run in a reasonable amount of time in terms of the phylogenetic analysis. Trying to run a larger group, such as the entire class, would likely be difficult. Also, these are large groups and so I think would be good for us in terms of comparing the pipelines. If these run successfully, then we might discuss running a couple of other groups (e.g. Echinodermata, which had significant results in the sister pipeline).

If possible, I'd like to suggest to use Matt's end workspaces and the "finaltrim" alignments as your starting point. This would give the same input dataset for comparison. Does that work? These files are within the taxon-specific folders under "Chordata" within the "Results" folder.

For these orders, you could use the reference sequence from the opposite order to root the tree. The reference sequences are in the "Reference sequences" folder.

Please let me know if you have any comments about these suggested plans (and Matt too). Thank you very much.

Cheers, Sally

jmay29 commented 7 years ago

Hi Sally!

That is great news! I will run these this weekend. So, just to confirm, the PGLS formula I would specify is:

branchLength ~ latitude + numberOfNodes

to control for node density effect?

Jacqueline

sadamowi commented 7 years ago

Dear Jacqueline,

That would be great - thank you for running these analyses in the near future. If you like, I can also do one or both of the runs on my laptop so we don't tie up your computer on this.

Yes, I agree with you that we should control for the number of nodes. As well, we should keep in mind that the statistical tests performed by the sister pipeline are different. We were seeking pairs differing in latitude. Here, by contrast, we would be treating latitude as a continuous variable. The data in the present form would be:for each BIN, median of the absolute value of the latitudes for the records within the BIN. For PGLS, I suggest look at a histogram of this variable. We should consider whether a data transformation is needed to make this suitable for parametric statistical analysis.

Thank you very much.

Best wishes, Sally

m-orton commented 7 years ago

This issue was moved to jmay29/lat-project#3