m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Using Cloud Computing with RStudio #26

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Hi Sally,

I think I found another way of doing the more computationally demanding taxa in case we cant use the Compute Canada resources. Sort of a back plan I guess.

Basically it involves using the cloud computing resources offered by Amazon Web Services. Specifically, it would be using one of their services called Elastic Compute Cloud. I found a useful guide with how to integrate this resource with a web browser based version of RStudio: http://strimas.com/r/rstudio-cloud-1/

I managed to go through the steps and run a free instance of this service. I was able to test some of the script and it seems to work well. The only thing is that in order to use greater amounts of computational power, there is a small cost per hour. It would be something on the order of $1-2 per hour in order to get the computation we would need for the larger taxa I would guess.

Just thought it might be something to consider.

Best Regards, Matt

jmay29 commented 7 years ago

Hello!! Thanks Sally! I will totally use this more often now :) I am still going through the pipeline and checking each object using my fish data (the small subset). No issues so far, but I think my large dataset would require a lot more comp power to run. I might try it on the cloud myself and let you know how it goes!!

m-orton commented 7 years ago

Hi Sally and Jacqueline,

Sally - I was able to do some quick filtering of Lepidoptera with the Lat/Long boundaries you provided.

Also, I checked the boundaries using this site http://boundingbox.klokantech.com/ and they look good.

I think I end up with pretty reasonable sizes for each region: Australasian: 8418 bin seqs NA: 18247 bin seqs SA: 10413 bin seqs Rest of world: 19122 bin seqs

I think if the alignment can handle almost 18000 seqs, it should be able to handle another 1000 or so for the rest of the world. I'm going to double check that I dont have any overlapping bins in differing regions just to be sure but I think these regions should be good in terms of size to proceed further.

Do you think I can proceed to the next pipeline steps with these regions?

Also, I realized Mesostigmata also had the lat/lon boundaries imposed on it but I think the alignment can easily handle the full mite dataset so I'll do the alignment with the full dataset and what the alignment looks like. I'll also try the final alignment with this dataset after removal of divergent sequences and see if the final alignment looks any better.

Jacqueline - Thats great to hear that the pipeline is working for you so far. Let me know if you need help setting things up on the cloud. Also, I think I should mention there is currently a limitation with the pipeline that it cant run single classes at a time. At least two separate classes (or possibly orders) need to be run at a time for it to function properly.

Best Regards, Matt

jmay29 commented 7 years ago

Great! I am looking the classes Actinopterygii and Sarcopterygii for now. I was wondering, if I come across something really small that I think should be edited/changed (i.e. spelling or changing a single line of code) should I just let you know on here?

m-orton commented 7 years ago

Hi Jacqueline,

Sure that would be good, maybe just make a separate issue with the errors you find and note the lines where the errors are and then I can correct each branch.

Thanks, Matt

jmay29 commented 7 years ago

Sounds good!!!

m-orton commented 7 years ago

Hi Sally,

Just wanted to give a progress update. I've completed the prelim alignment steps for all of the geographical regions you suggested for Lepidoptera and the alignments all look good. I've posted them to the dropbox folder if you want to take a look. I'm now working on generating the pairing results for each region starting with Australasia.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

That's great news that things are going well for leps. In light of these quite large alignments you are creating, I agree with the most recent strategy we discussed that we would only divide up the largest orders for analysis. That info is also helpful for me in completing the reference seq selection for our final groups. Thank you for this update and good news.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

I did some tests with the other large insect orders to see if we would need to divide them. I ended up with 20000 unique bins for Coleoptera, 29000 unique bins for Hymenoptera and 16600 unique bins for Diptera. (after filtering steps)

I'm thinking we probably wouldnt need to divide Coleoptera or Diptera but we may need to divide the Hymenopterans into regions. I think the rest of the insect classes we can probably run in one shot. (hopefully)

Luckily I managed to get the script working with single classes/orders so we dont have to worry about pairing up the larger insect orders with another group. I hope to update all of the branches later today with this change.

Best Regards, Matt

m-orton commented 7 years ago

*Meant to say Insect orders

sadamowi commented 7 years ago

OK sounds great Matt. If we do need to divide some of the other insect order, but not as much as leps, we could go with fewer regions, depending upon the numbers in each region. Lumping regions is more to our advantage in terms of generating pairs. However, if just one other group needs splitting, we could perhaps go with what we used for leps, for simplicity.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Good news, I've managed to do a full runthrough of Lepidoptera for all regions now.

I've posted all of the results to dropbox. In total, I am getting 2876 pairings! The final trimmed alignments all looked really good and the pairings generated for each region look good. (though I haven't had time to fully go through the pairing results yet)

Also, I think I will run through Mesostigmata separately (since I can run single classes now) so that it is not divided by region. Does that sound ok to you?

Let me know what your thoughts on what to work on next.

Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

That's great news! Solving this challenge is a real milestone for this project, especially given the huge size of that taxon - well done!!

The only lingering concern that I have for leps is that it's possible some phylogenetic pseudoreplicates could be missed. While I would expect some regions of the world NOT to exhibit this problem or only extremely minimally so (e.g. North America vs. Australia), some other regions could contain phylogenetic pseudoreplicates (e.g. North America vs. Europe).

I think that the simplest solution for this would be to briefly acknowledge this issue ... i.e. that some phylogenetic pseudoreplicates could have gotten through. However, we would also point out that these represent geographic replicates that have likely been evolutionarily isolated for some time and thus contribute to the project.

I think that other solutions could be considered for the future, such as using pairwise alignments (or another solution) for the distance calculations such that more taxa can be included in a single analysis step. In the manuscript, we might point out such cases of future improvements that could be made to the pipeline to enable analysis to scale up even further as barcode data continue to accumulate.

What do you think?

Given this consideration, I suggest that we only subdivide by geographic region when essential (i.e. alignment won't finish/crashes in large taxa), as we already discussed.

Yes, that sounds like a good idea to run through Mesostigmata separately. However, I haven't yet deeply looked into the alignment issues you previously detected for that taxon. So, you could hold up, if you wish, until I've done that task.

I think that after that it would be to hear back from Jacqueline about her exploration of the behaviour of the code.

Jacqueline - Any issues detected? Again, the question isn't whether or not there are any possible improvements that can be made to the code or how fast it runs. The question is: is the code doing what we think we are telling it to do? Are the results we are generating correct?

Matt - I think, then, hopefully we would be ready for a final run-through of any taxa that need to be (re)run using the revised code and a consistent version of R (or the server, as you've been doing). I could help to check alignments a final time for all taxa. I am aware for this step we are still missing some REF seqs (Arthropoda and Chordata), and that is my highest priority next task.

Best wishes, Sally

sadamowi commented 7 years ago

PS. I was so interested I had to look at the results as these are our first results with a really huge number of pairs. I looked at all p-values and also the more detailed results for North American leps. It was amazing for NA leps how close the relative outgroup distances were to 1. That is very interesting. I am not at all disappointed not to get a significant p-value. I think these results are telling us that there is not a systematic bias in rates with latitude. That is a really interesting finding. We will have a lot to talk about in the discussion. I just wanted to post this quick note in case you felt disappointed at not seeing the originally hypothesized results.

m-orton commented 7 years ago

Hi Sally,

Thanks, its great to get Leps out of the way. All other groups should be easy after that.

The p-values did surprise me since I was sort of expecting significant values but I think this result is also really interesting. There is also a difference in p-value between the Australia region which had a much smaller pvalue (though still not significant) and the other regions which I thought was interesting.

In regards to dividing by region, the only other group where I think it will be necessary is Hymenoptera (29000 unique bins). Past 20000 bins or so, it becomes challenging even with the cloud computing resources. I'm wondering if it may be possible to split into two regions to save time on doing multiple runthroughs or if we should stick with the four regions?

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

I think that is is best to divide taxa into as few regions as possible. This would maximize our recovery of legitimate pairs and reduce the number of phylogenetic pseudoreplicates generated in different regions.

I'm not sure exactly what is the best course. North America and Eurasia can have close phylogenetic relationships, especially as dispersal has occurred in many taxa across northern regions of the Holarctic. (For example, North America and Asia were recently joined together by the Bering Land Bridge at a time of lower sea level. For flying animals, there are various dispersal routes around the Holarctic.)

If there are not too many BINs to run, you might go with:

  1. Australasia
  2. South America
  3. rest of world (especially North America with Eurasia, and I'd imagine Africa wouldn't have too many records.)

If you would like any further input, depending upon the number of BINs per region, please do let me know. Does this suggest make sense?

Cheers, Sally

jmay29 commented 7 years ago

No issues so far with the fish subset, but I am running it over once more right now and only looking at bony fish. I'll let you guys know how it goes and I will post the results to DropBox in a bit.

m-orton commented 7 years ago

Hi Sally, I'll get an estimate of bins for each of the regions you mentioned and get back to you on this. But I think those regions should be good to run.

m-orton commented 7 years ago

Hi Sally, I think we might have to divide into four regions for Hymenoptera, the combined NA/Eurasia/Africa region is still very large - about 24000 bins while the other regions are much smaller. If we separate NA then the divide would be roughly 15000 for NA, 9000 for Eurasia/Afr, 2000 AUS, 2000 SA. Would this be ok?

sadamowi commented 7 years ago

Hi Matt,

Thanks very much for these counts. Given this info, I think this makes sense to run the Hymenoptera in the same way as Lepidoptera. That would also streamline the manuscript if there is just one description of the regions, applying to both taxa. I am also thinking about this further. Evan though some phylogenetic pseudoreplicates could be generated, the more common dispersal routes between North American and Eurasian insects would be through northern regions. Therefore, each region would most likely represent independent cases of north-south dispersal. I will think about this further, but I think this is justifiable. But we'll have to be sure this is clearly phrased for the reviewers.

Best wishes, Sally

m-orton commented 7 years ago

Ok sounds good Sally. I think we could also mention the computational limitations involved with the large insect orders and explain we had no choice but to divide things up into regions as well.

sadamowi commented 7 years ago

Hi Jacqueline,

As you will see from the "Sally's tasks" thread, I have completed reference sequence selection for the final groups.

Would you please confirm that you have completed your checking and run-through of the code?

When we spoke last, you mentioned that you will have just some minor recommendations: e.g. formatting and also typos in the commenting. Does that remain the case? Do you think we are good to go for generating the final results?

Thanks very much.

Best wishes, Sally

jmay29 commented 7 years ago

Hi Sally! I am just getting together my small edits and I'll post them in a bit (just making sure I didn't miss anything). I also came across a small error in the code that I haven't yet figured out. But I should be done by tomorrow :)

sadamowi commented 7 years ago

OK - sounds good. Thank you Jacqueline.

jmay29 commented 7 years ago

Hi Sally and Matt, I just ran through the entire pipeline and most everything ran smoothly except I got a few errors towards the end with the graphs (posted this in separate issue). But things are looking great!

m-orton commented 7 years ago

Awesome to hear the pipeline is running well! Besides the errors you mentioned, do you think Im good for doing the final runthroughs of each phylum?

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

That's great to hear that everything is progressing well towards completing the final run.

Jacqueline - Do you agree that Matt is good to go?

Matt - In light of recent edits to the code, I'd just like to confirm that you are downloading and saving a full dataset for each taxon and secondarily filtering for the marker after download from BOLD? Thanks very much for letting me know.

I'd like to have a look at these initial downloads after you are finished. I'd like to consider running a secondary marker for those taxonomic groups that have substantial representation by a secondary marker. I would do that after you complete your run for COI. After recently receiving reviews on another paper, I think reviewers would like to see the results validated with another marker, if that's feasible. It may not be feasible, but I wanted to check. Thank you.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

I have saved initial workspaces for Leps, Mollusca and Annelida (before filtering by COI) but I dont have Cnidaria or Echinoderms since I think you did the most recent runthroughs of these.

I'll post the workspaces I have to dropbox. I'll just make a separate folder for initial workspaces.

Btw are both of you ok on space for your dropbox? The workspace files are generally pretty large so I dont want to overload your dropbox.

sadamowi commented 7 years ago

Dear Matt,

Thank you very much for confirming that you are downloading all available data, not only COI in the initial download. At the time I was last doing some runs, I wasn't yet saving initial workspaces. So, as you run through the phyla with the final code, I'd appreciate that very much if you'd save initial workspaces. That way, I can have a look at those, using the exact same download as we are using for the final COI results. Thank you very much.

Personally, I am fine for Dropbox space. Thank you for checking. If Jacqueline is tight for space, we could create a separate folder for just the two of us for the large files, because as I recall you purchased a pro account too.

Best wishes, Sally

m-orton commented 7 years ago

Ok no problem, I'll post the initial workspaces as I do the runthroughs with the final code.

sadamowi commented 7 years ago

OK - thanks very much Matt. I think that would be great to have that consistency, and this will also make it more efficient for me to look through those files.

Cheers, Sally

jmay29 commented 7 years ago

I got a Pro account, too! I figured I could use the extra space as well. And, I would say Matt is good to go! :)

Which secondary marker were you thinking of, Sally?

sadamowi commented 7 years ago

OK great Jacqueline!

We are just discussing a few issues of "weird" errors that our filters to date haven't caught. Most groups should be good to go.

However, it will be important for us to continue to check the alignments through to the end. I suggest that you also read these threads and check for such errors in your alignments for your MSc work.

Cheers,

Sally

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Jacqueline notifications@github.com Sent: Monday, January 23, 2017 2:18:25 PM To: m-orton/R-Scripts Cc: Sarah Adamowicz; Comment Subject: Re: [m-orton/R-Scripts] Using Cloud Computing with RStudio (#26)

I got a Pro account, too! I figured I could use the extra space as well. And, I would say Matt is good to go! :)

Which secondary marker were you thinking of, Sally?

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/26#issuecomment-274588449, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89OhVk7IMrtNCFgbbsPUQKxhYDza8Lks5rVP0BgaJpZM4LNZoL.

jmay29 commented 7 years ago

Hi Sally - There is a weird sequence in my Actinopterygii alignment too (I posted the alignment in our shared folder with Zeny). It's BOLD:ADC1808:

http://www.boldsystems.org/index.php/Public_BarcodeCluster?clusteruri=BOLD:ADC1808

There are two different orders in this BIN!

jmay29 commented 7 years ago

Interesting though, this BIN was only included after I removed the filters to only include BINs with more than 1 sequence and to only include BINs that have at least two sequences that bare the same species level identification.

sadamowi commented 7 years ago

Hi Jacqueline,

That's great to hear that you think Matt is good to go and you didn't find any major problems.

That BIN that you mentioned (BOLD:ADC1808) is indeed very odd. It looks even more wrong in the amino acid view. I looked at the records for that BIN. There are only two records, and there are no trace files available for verification of the text sequence. I suggest to delete that BIN. I think it remains helpful for us to continue to visually check the alignments. As you further develop your R tools, something that you could consider is how to screen for these cases informatically, if you have time. For example, which sequences are vastly different from others in terms of their numbers of gaps after alignment? (That's just one thought; there could be various ways to tackle that.)

In terms of your question about a secondary marker, I was going to first see which markers are common. I was thinking maybe 28S, as that is nuclear and is commonly used as a secondary marker for some taxa. But I'd like to have a look at the data first to think about whether it is feasible to run such a test on another marker.

Cheers, Sally

sadamowi commented 7 years ago

PS. That fish BIN (BOLD:ADC1808) has multiple insertions compared to everything else in the class. One of the insertions is huge. Also, there are some matches between parts of this sequence and some plant COI sequences. Again, with no trace files, I think this one should be deleted. I also added this to an email I sent to the BOLD team to check into and potentially flag some of these really odd sequences. (If a sequence is flagged, it wouldn't get used in the ID engine.) Thanks for pointing this out.

jmay29 commented 7 years ago

Hi Sally,

Yes, that was definitely a weird sequence hahaha. I hope I don't find too many more of those. And that is a great idea to screen for cases like that. I'll try some things out and see if I can catch any rogue, gappy sequences.

sadamowi commented 7 years ago

Hi Jacqueline,

That sounds great, and I see you have started discussion of that in a new issue. I think we can close this issue as there are no outstanding issues. I'd still like to check into the prospects of using a second marker, at some point, but will add that to my tasks rather than have that buried in this long thread.

Last thing - Matt - here again is the BIN number for the weird sequence to delete from fish (phylum Chordata): BOLD:ADC1808.

Best wishes, Sally