m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Sally's next tasks #30

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Hi Matt,

Here is my list of planned next steps.

HIGHEST PRIORITY:

  1. Select group for cloud test for Arthropoda and generate ref seqs for that group plus Lepidoptera.

  2. Select ref seqs for Arthropoda and Chordata.

NEXT PRIORITIES:

  1. Verify the Wilcoxon test is running correctly, as requested by Matt.

  2. Try Matt's suggestions for resolving plotting issues.

  3. Focus on the draft manuscript next, specifically: a) complete draft of methods section (be sure to include software version info) b) type up more detailed analysis information and justification for settings as a supplementary file c) prepare a results table, including preliminary results (and placeholder spots for those taxa remaining to be run)

AFTER THAT. Sensitivity analyses and exploration to add further justification for choices:

  1. Verify for Echinodermata that pairwise vs complete deletion make a minimal difference to the results. (i.e. I hypothesize that the molecular rates in the front section with substantial missing data in echinoderms is similar to the molecular rate in the remainder of the barcode region of COI.) We have currently settled upon pairwise deletion as our default for most of our phyla.

  2. Explore impact of adding a gamma parameter upon the distance matrices and results.

  3. Explore whether altering the outgroup minimum distance (to 1.4 or 1.5) makes any difference.

Please do let me know if I've overlooked anything!

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

I apologize for my delay. I've finally completed task #2 above. The reference sequences are located in the "Reference sequences" folder, in tab #3 of the Excel file and also in the FASTA files.

Given what you have found about the size of datasets that will run, I have mainly selected reference sequences at the class level. However, for very large classes, I selected an additional reference sequence in the case of a very large order (in terms of number of BINs). I selected 4 reference sequences for Insecta, i.e. the 4 largest orders, as you previously requested.

Please let me know if I have missed anything or if you need anything further.

Are you available to run through the rest of Arthropoda and Chordata on the server?

Also, I am wondering if I should run through the three smallest phyla with the revised code? Does that make sense for me to run these on my computer? Or, do you think it makes sense (and are you available) to run those using your settings for complete consistency for obtaining the final results?

Thanks very much for letting me know.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks very much for completing the reference sequences.

I'll wait to see what Jacqueline says about the pipeline and then I'll start running through the other large insect orders. I should be able to complete Arthropoda and Chordata on the server.

In regards to Mesostigmata, should I try running through to the final alignment to see if removal of divergent sequences helps the alignment at all or should I wait for you on this one?

Also, I should mention that I realized my counts for Diptera and Coleoptera were wrong. Diptera is closer to 47000 unique bins and Coleoptera is closer to 20000 unique bins. So Diptera will need to be broken down by region as well but I should be able to get through Coleoptera in one run. These will probably take a few days for me to run.

For running the smaller phyla a final time, I'm tempted to say let me run everything on the server since the server does a have a different version of RStudio and R. As you mentioned, this gives our results better consistency and we can say that all taxa were run in the exact same environment with the exact same version of R and R Studio. I have a lot of free time this coming week so im ok with running through these phyla again.

Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

Thank you very much for offering to run through all groups that need a go with the final code.

I can have a look at the Mesostigmata alignment in the next 1-2 days and let you know if I can figure out what was going on with that alignment.

OK - That sounds good to break down any insect orders that need it using the same geography you used for leps.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

I am updating my TO DO list for this project. Please let me know if you have any comments or if I've missed something. Tasks 1-3 above are complete, and so this is a revised list.

HIGHEST PRIORITY:

  1. Check "Issues" regularly and reply as promptly as possible to help resolve issues.

  2. Try Matt's suggestions for resolving my plotting issues.

  3. Work on draft manuscript, specifically:

a) complete draft of methods section (be sure to include software version info) b) type up more detailed analysis information and justification for settings as a supplementary file c) prepare a results table, including results to date (and placeholder spots for those taxa remaining to be run) d) literature review (include search for "Santa Rosalia" papers) and complete the introduction for the paper. e) write up results once full results available f) write up discussion g) circulate draft to all coauthors

WHILE DRAFT IS BEING REVIEWED BY COAUTHORS:

Sensitivity analyses and exploration to add further justification for choices:

Verify for Echinodermata that pairwise vs complete deletion make a minimal difference to the results. (i.e. I hypothesize that the molecular rates in the front section with substantial missing data in echinoderms is similar to the molecular rate in the remainder of the barcode region of COI.) We have currently settled upon pairwise deletion as our default for most of our phyla.

Explore impact of adding a gamma parameter upon the distance matrices and results.

Using a subset of taxa, explore whether altering the outgroup minimum distance (e.g. to 1.5) makes any difference.

If Jacqueline's code is available, compare sister vs. phylo pipeline results for latitude for fish as an example. Touch base with Jacqueline about this component. For discussion only, likely not for inclusion in the paper. Jacqueline's results could be cited as "in preparation" here for initial submission, and then Jacqueline's thesis cited in the final publication.

Look into whether some taxa may have sufficient data for a secondary marker to re-run the code using another marker. Likely for incorporation into the final version.

Please do let me know if I've overlooked anything!

Cheers, Sally

jmay29 commented 7 years ago

Hi Sally!

This sounds good to me. I should have results for latitude in the phylo pipeline very soon.

sadamowi commented 7 years ago

Hi Jacqueline,

Thanks very much. Sounds good. That will be interesting to do a comparison. For comparison, it would be good to keep as many things the same as possible. For example, perhaps you could use Matt's end workspace for Chordata as your starting point. That would mean that the initial data download would be kept the same and the alignment too. Also, for example, you would want to use the centroid finder rather than your consensus sequence tool. Basically, we'd want as many aspects consistent as possible in order to compare the sister vs. phylo approach. I think that would be very helpful to cite that result. Some reviewers would prefer to see a formal phylogenetic approach, rather than sisters based upon distances alone. For latitude, it would be interesting to run latitude alone plus also latitude + number of nodes. After you get the results, we can touch base again about what would work best ... e.g. citing your thesis? If you don't plan to separately publish that sister vs. phylo. comparison, another option would be to put that comparison as a supplementary file in the latitude-focused paper. We can touch base again soon about your own thesis and planned papers to discuss what will work best.

Best wishes, Sally

jmay29 commented 7 years ago

Hi Sally! This sounds like a great idea to me!

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

I am updating my TO DO list to help me to stay organized. Please do let me know if I've overlooked anything. Matt prepared a first draft of the methods section, and we continue to work together to refine this by passing the draft back and forth. It is coming well. I've also completed the literature search and making notes for the intro. I will move on for now, as there are a few results-related issues that are holding up progress. Then, I could write up the intro while Matt works on the final results/figures, for example.

HIGHEST PRIORITY:

  1. Check "Issues" regularly and reply as promptly as possible to help resolve issues.

  2. Implement formula (discussed by email) to estimate relative branch lengths using the ingroup and ingroup-outgroup distances. Consider how pseudoreplicates can be accommodated in this fix.

  3. Work further on methods section (after receiving next draft back from Matt). Include method for estimating relative branch lengths.

  4. View final alignments and complete results table and results prose (perhaps this can be done together). I think we should point out what the results are when retaining only those pairs including a tropical member (for select taxa).

  5. Collaboratively, select final taxon for map (if we think this is suitable).

  6. Complete draft of Intro in full prose.

  7. Work on discussion. (I propose to work on this section collaboratively.)

  8. Full draft circulated to coauthors - target date March 20th

AFTER THAT

CAN BE AFTER SUBMISSION:

Please do let me know if I've overlooked anything!

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

I have implemented the solution for estimating relative branch lengths for Cnidaria, Annelida, and Echinodermata. I added those results in new tabs in the same results Excel files (marked "SJA"). If you'd like to look over how I did this, the main formulae are in 5 new columns (E-I) in the "RelativeBranchLengths" tab. The main results are in the next tab after that, as I also averaged the pseudoreps.

As expected, the binomial test results were very similar. In two phyla, the binomial test results were identical. In one case, this procedure resulted in a sign flip in one value near zero after averaging a pseudoreplicate. This changed the positive/negative counts by one. The Wilcoxon results are fairly similar as well. However, I think that now we will have more meaningful median values to compare with the effect sizes reported in the literature.

I can readily implement this solution for Mollusca and Chordata. Of course, Arthropoda is always more tricky, as it is so large, with results spread across many files. So, I'd like to make sure we are happy with our approach before tackling that.

I think we may want to consider going for the pair with the lower ingroup distance in the case of pseudoreplicates. What do you think? (This is not the same as going with a minimum distance to select sequences within BINs). This would be selecting BINs that are more closely related and thus likely share more biological features and shifted between the tropical and other thermal zones more recently. This can reduce noise compared to any signal relating to latitude and has been discussed in the literature.

I wanted to discuss this before tackling Arthropoda in particular. (This change can be easily made in these small taxa I've been working with today.) As well, I suggest that we consider the same solution if there are pseudoreplicates between geographic regions.

Thoughts on this issue?

As well, I suggest for the Github code version, it would be a good idea to code in the new method for estimating the relative branch lengths. We could run that on a small taxon to compare with my results. What do you think?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks for going through and doing the branch length calculations. Thats interesting that the results end up being very similar.

For Arthropoda, since its so large I could code in the branch length estimation and get R to do most of the work. I like your idea for the pseudoreplicates, I could also add some extra code that would find the lower ingroup distance for each set of pseudoreplicates if you like?

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much. Yes, I am a fan of having R do most of the work! I wanted to go through the calculations manually and carefully for a couple of small datasets to make sure that the approach works.

I have been thinking about it more, and I think that that would be great if we'd go with the pair with the smaller ingroup distance for pseudoreplicates (both the regular pseudoreplicates and in cases where the same BIN appears in multiple pairs across the geographic regions, for those taxa analyzed by region). Do you agree?

I am going with that approach to addressing the pseudoreplicates for the Mollusca file to see how it goes. I have that Excel file open. I we agree on that, then I'll change that for the three smallest phyla too.

Best wishes, Sally

sadamowi commented 7 years ago

That's funny. I realized that I consider using Excel as "doing calculations manually". It feels that way!

sadamowi commented 7 years ago

PS. Before we make a final decision about the pseudoreplicates, I suggest that we make some sketches about the scenarios in which pseudoreplicates can occur. I think that will help to elucidate which is the best choice for dealing with these. I can work on that later today or tomorrow, but please feel free to do that as well if you have ideas about this. I am leaning towards sticking with our original decision for the "main" pseudoreplicates. So, please don't recode that aspect yet. Please do jump in if you have thoughts on this.

sadamowi commented 7 years ago

Hi Matt,

I invite you to see this file "Figure showing how pseudoreplicates can occur" (PPT), located within the new "Supporting files" folder within the manuscript folder. Do you have any comments or any other scenarios I haven't considered?

Best wishes, Sally

sadamowi commented 7 years ago

PS. I have completed the analysis in Mollusca using both the minimum ingroup distance method and the averaging method to deal with the pseudoreplicates. The results were similar but not identical between these methods. Note that I am looking at the effect sizes not minor differences among the (highly nonsignificant) p-values. In this case, there were 21 pseudoreplicates out of a total final sample size of 92 (after accounting for the pseudoreplicates). If there are any taxa with a very large proportion of pseudoreplicates, we may wish to repeat the analysis both ways to assess whether there is an impact. Otherwise, I suggest we could stick with the average, unless you have any other perspectives on this issue.

Best wishes, Sally

sadamowi commented 7 years ago

PPS. I forgot to mention that the way I implemented the branch length calculations involved formulae that were different every OTHER line in Excel. I wanted to point that out in case you are looking at the Excel files as an example of the calculations.

m-orton commented 7 years ago

Hi Sally, thanks for your work on this, in regards to the pseudoreplicates, I think its easier to stick with the average (just from a coding perspective) but if you think there is an advantage with going with the min ingroup distance Im ok with going that route as well.

Today, I'll spend some time coding in the excel formulas you implemented for the insect groups.

Best Regards, Matt

sadamowi commented 7 years ago

OK sounds great. Thanks Matt. Also, from a theoretical perspective, there were a number of cases where average made more sense.

Best wishes, Sally

jmay29 commented 7 years ago

Hello! Things seem to be going well! Is there anything else you would like me to look at? (Of course, my main goal is to get the phylo pipeline up and working :) )

sadamowi commented 7 years ago

Thanks Jacqueline. After one more round with Matt, I think it would be very helpful if you'd read over the available sections (Methods section, results section, and Table 1) in the manuscript file. I will let you know when those are ready.

Best wishes, Sally

jmay29 commented 7 years ago

Sounds great! Thanks Sally!

m-orton commented 7 years ago

This issue was moved to jmay29/lat-project#7