m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Using Cloud Computing with RStudio #26

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Hi Sally,

I think I found another way of doing the more computationally demanding taxa in case we cant use the Compute Canada resources. Sort of a back plan I guess.

Basically it involves using the cloud computing resources offered by Amazon Web Services. Specifically, it would be using one of their services called Elastic Compute Cloud. I found a useful guide with how to integrate this resource with a web browser based version of RStudio: http://strimas.com/r/rstudio-cloud-1/

I managed to go through the steps and run a free instance of this service. I was able to test some of the script and it seems to work well. The only thing is that in order to use greater amounts of computational power, there is a small cost per hour. It would be something on the order of $1-2 per hour in order to get the computation we would need for the larger taxa I would guess.

Just thought it might be something to consider.

Best Regards, Matt

m-orton commented 7 years ago

Oh another thing too is the web browser version of RStudio we would have to use would be version 0.99.896, running R 3.3.0. So we wouldnt be able to run the most recent versions of R or RStudio. However the amount of computation you can get is really impressive. You can look here at the types of instances offered: https://aws.amazon.com/ec2/instance-types/

Also, the web browser based version of RStudio has dropbox integration so we could potentially just output all FASTA and CSV files directly to Dropbox which would be cool.

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally,

Just managed to finish running a small subset of the Annelida tsv using EC2 on amazon web services. Had to make a few slight changes to the code to get everything to work but it seems like I can get all the code components running smoothly including the plot and the plotly map. Only difference was I had to install the stats packages as well to run the binomial and wilcoxon tests. Im running a free instance (t2.micro) that doesn't have much computational power so I couldn't run all of Annelida.

You can check out the instance I'm running here since I thought you might want to take a look: http://ec2-35-163-115-55.us-west-2.compute.amazonaws.com/

You will have to log in with username: rstudio password: Cinnamon13*&

It should then load the script, and in the bottom right corner you should see files for all of the alignments I have done plus the Annelida workspace I generated which you can load if you like. I have installed all of the packages so you could also test by running through the code if you want. It should only take a couple of mins since its such a small dataset.

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally, just wondering if you got a chance to take a look at the cloud computing instance Im running and your thoughts on it?

I think potentially with a much faster instance it would be possible to rapidly go through larger taxa once we have the scripts finalized.

Best regards, Matt

sadamowi commented 7 years ago

Hi Matt,

I think this looks really great. I read over the above and logged in and had a quick look. I don't think I need to run this, separate from you. In the interests of time, I suggest it makes sense for me to stay focused on resolving the last few settings decisions, which we can hopefully finalize in the next day or two.

Are you able to confirm that you've run a small test dataset both on your own computer and on the cloud and confirmed that you obtained the same results? If so, then that's great. I am happy to go with this option to run the last two large phyla, especially Arthropoda.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks for taking a look.

Good news, just confirmed results are the same. Pairing results appear to be the same and p-values as well. I made sure to use the same initial workspace for both offline and online.

However I think its possible there might be some minor differences in larger taxa since the online R versioning is different from my offline R versioning. Online runs RStudio version 0.99.896, running R 3.3.0 “Supposedly Educational" whereas we are running RStudio version 1.0.44 with R 3.2.4 "Very Secure Dishes".

Best, Matt

m-orton commented 7 years ago

Also, Im going to add the stats package as well to each branch since the online version seems to require it be included.

Matt

sadamowi commented 7 years ago

Hi Matt,

That's great news that you are finding that the cloud version yields the same results as your computer. Thank you for confirming this!

I presume that the small datasets will run very fast on the cloud? Perhaps at the end, we can run through everything there, so that everything is run the way way, same versions, etc., for the very very final results? That's just one idea to address this.

I got exactly the same results with two very recent versions of R, and so hopefully there would be minimal differences. I am not eager to change the versions I am running right now, as I want to keep everything else the same while varying the maxiters and trimming settings.. i.e. only vary one parameter at a time. I think those results would translate to a different version just fine. I could update later if we think that's the way to go.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

The smaller phyla should be very quick with a good instance in the cloud. I think for the final results we could rapidly go through each of the smaller phyla on the cloud to get our final results.

Actually, the online RStudio seems to be better in some respects, the plotly map actually appears in the viewer within RStudio and the plotting command seems to work as well. Also, another thing to point out is that the online RStudio is actually running on Linux as well.

In regards to testing a large insect order this week, I think Lepidoptera would be best to test since it will require the largest amount of computational power. I'll have to do some research on the instance to use but I think we are going to have to go with an instance with a lot of memory.

Right now I'm thinking the m4.10xlarge instance (160 Gb RAM). It seems like a lot but Lepidoptera alone has roughly 87000 unique bins (after the filtering and centroid finder steps) and I estimate for every 1000 bins processed that requires roughly 1.5 Gb of RAM. So that ends up being roughly 130.5 Gb total memory we need. This doesn't account for filtering of bins which do not have the COI-5P marker however so the number of unique bins may be less than that 87000 figure.

The cost per hour of the m4.10xlarge instance is $2.155 US so even it took several hours I dont think we would be looking at anything too expensive.

EC2 instance types: https://aws.amazon.com/ec2/instance-types/ EC2 pricing: https://aws.amazon.com/ec2/pricing/on-demand/

Also another thing to consider as well is how the Muscle alignment will hold up with such a large number of sequences. Doing a quick literature search, I couldn't seem to find anything on large scale DNA alignments with Muscle so it seems what we are doing is somewhat unprecedented. I guess we will have to test and find out! I know from large scale alignments of amino acids, the alignment accuracy does eventually start to suffer with increasing numbers of sequences but I dont know it that holds true for DNA as well? I would assume that it does.

Let me know if this all makes sense, I just started learning about Cloud Computing so I could be over/underestimating the computational power we would need.

Best Regards, Matt

m-orton commented 7 years ago

Actually, looking into the instances further, the r4.8xlarge may be better in price/performance and is optimized for memory intensive applications which seems to suit our needs better. It offers 244 Gb at $2.144 but has slightly less CPU power than the m4.10xlarge instance.

Best regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for researching the best option for our initial cloud trial with a larger dataset.

Your suggestions sound good to me!

I would expect the alignment2 to be the sticky point. Although a large group will have a few very large BINs, overall, I'd expect the computational needs to scale more linearly with the number of BINs... just chugging through! alignment2 will be our real test, followed by the final alignment, which should be easier.

If we don't get a reasonable alignment ... then we would have a few options ... we could do just pairwise alignment and pairwise distance calculations prior to the ejection of highly divergent sequences, and then go for a global alignment for the final alignment, if possible.

That is just one idea. Hopefully this will work! :-)

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

I also have a question in response to one of your above points. You mention updating the various branches of the code. I'd like to request confirmation that you used the revised Annelida code as the basis for updating the other branches? We made a lot of changes recently to that branch .. both the code itself and numerous edits to the commenting. Thanks very much for confirming!

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

I updated the Mollusca branch after most of the major changes to Annelida were made including all of the major changes to the comments and the divergent sequences removal. Right now I have maxiters set to 3 for alignments 2 and final in this branch. I only have diags set to True for the centroid alignment.

The Cnidaria branch I updated a few days ago by copying and pasting from the Annelida branch. I then added in the reference sequences you sent me for this branch. You may have to change the alignment settings once you determine what the best alignment settings are for this branch. Right now I have it set to maxiters =3 and diags = TRUE for each alignment.

Since then I've been careful to update all branches when I add new code components (for instance the COI filtering and the Jsonlite and Stats packages).

The one branch I have not updated yet is the Echinoderm branch because I wasn't sure on what alignment settings to add in for it yet.

Also, for running Lepidoptera on the cloud, just realized I'll have to make a separate branch for order level analysis as well.

Best Regards, Matt

sadamowi commented 7 years ago

HI Matt,

Ok great! Thank you for verifying that you updated the Mollusca and Cnidaria branches to match Annelida. I'm glad to hear we don't need to repeat all the edits to the commenting.

I ran Cnidaria several times yesterday, and it worked well (except I set a heavier gapopen penalty, as discussed in the alignment branch). I am running Annelida again now (just with a different gapopen penalty: -3000 as a test).

If you are able to update Echinodermata too, to match the other class-level branches, that would be great. I'd like to run that one too. You could use maxiters=3 and gapopen=-3000 for now, based upon my findings for Cnidaria. We might need to adjust those, but I'd like to see how those settings go, and I think those are a reasonable starting point for echinoderms. Thank you!

For the Lepidoptera run, is that necessary to create a different branch? Or, could that be attempted with the class-level code for now, but just inputting one order as the test? There would be just one class in the input dataset, but would that be fine?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

No problem, I'll update Echinoderms with those alignment settings.

For Lepidoptera, I think it would be ok to run with the class level script. I would just have to specify Lepidoptera in the tsv download and then specify Insecta as the class being run to get it to work I think.

Best Regards, Matt

m-orton commented 7 years ago

All set to run Echinoderms! Just updated the Echinoderm branch with the changes.

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally,

In regards to running Lepidoptera, I'm currently making some changes to the class level script to run single orders or classes. I think we will have to have a separate version of the script when only running 1 class/order at a time.

I did some testing with Annelida to see if the script could run 1 class at a time. I tried only running Polychaeta and ran into some issues with some of the sapply and foreach statements since these statements were designed to iterate through multiple classes/orders at a time. I noticed especially with the sapply statements, if I tried to run a single class, these commands would function incorrectly. For instance the sapply statements I use to extract sequences (around line 550) are supposed to group sequences in a vector format however with 1 class they group sequences into a matrix format instead strangely.

I think it should relatively easy to modify the code but just letting you know about this issue.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thanks for letting me know!

I am wondering if there is an easier workaround rather than generating a new branch just for the test?

For example, we could pick two orders or families that fall in different classes?

Anyhow, I am just hoping your work on this can be minimized at this late stage!

Cheers, Sally

m-orton commented 7 years ago

Your right that would be easier, maybe i'll just pick a small order so it doesn't use a lot of computation and run it together with Lepidoptera?

Thanks, Matt

sadamowi commented 7 years ago

Yes, that sounds good to me!

sadamowi commented 7 years ago

Hi Matt,

Which other order would you like to run? Or, would you like me to find one? Thanks for letting me know. That way, I can prioritize the selection of reference sequences for Arthropoda so as not to hold up this test.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally, maybe it would be good to do an order from a class we havent done so we can get results for a second order at the same time. I think it should be a relatively small order though. Maybe another order from Arthropoda so we can work out any potential issues with Arthropods. I think it might be best if you pick the order though.

Best Regards, Matt

sadamowi commented 7 years ago

OK! I will look into this. Thanks for your thoughts.

sadamowi commented 7 years ago

Hi Matt,

My suggestion for the second group would be order Mesostigmata, which is in class Arachnida. While not a tiny group in terms of total number of barcode sequences (~18K), it has a modest number of total BINs on BOLD (1,420). It yielded just 15 pairs in your preliminary analysis of numbers of pairs. I also selected this group because it is fairly diverse in its amino acid composition (based upon Young and Hebert 2015. PLoS One). So, this would be a test of the performance of our settings in a suitable arachnid group. Likely, it has just a few pairs in relation to total number of sequences as it is poorly studied from a barcoding perspective, especially in the tropics.

Do you think that is a suitably sized group? I am looking into reference sequences.

Cheers, Sally

m-orton commented 7 years ago

Sounds great!

I'm still doing some research on cloud computing with EC2 but Im hoping to do a test run either tonight or tomorrow during the day.

Thanks, Matt

sadamowi commented 7 years ago

Hi Matt,

OK. Thank you for doing this test run.

I will send you the reference sequences through email. I do suggest to see the discussion thread from the alignment settings section prior to running this - thank you.

Fingers crossed!

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

Dare I ask how the cloud analysis for the Lepidoptera is going?

I also wanted to check whether it might be possible to run Chordata on the cloud? I would expect that to run reasonably well, even using the class-level pipeline, as the data set is only modestly larger than the Mollusca data set. If that's possible, that would help me a lot with working on the preliminary draft. The Chordata contains several groups that have been studied for molecular evolution previously. Also, as the group contains two endothermic groups as well as ectothermic groups, there would be some very interesting discussion points that could be raised for that group.

Thank you very much for letting me know. I would then prioritize getting together the Chordata ref seqs.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

The Lepidoptera run didnt go as well as I had hoped. With your permission, i'd like to try a higher tier of instance for Lepidoptera and possibly the other large insect orders. During the centroid finder step, the muscle algorithm actually maxed out memory usage surprisingly and R crashed. I guess I underestimated the resources it would take. I'd like to try using the r4.16xlarge, the next tier up from r4.8xlarge which is what I was using. It is more expensive - $4 per hour but i'd be happy to split the cost.

I'd be happy to run Chordata but I might not be able to get to it until after Christmas.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thanks for this information. Sorry to hear that didn't go as well as hoped.

Does it perhaps make sense to try a smaller group first, such as a specific Lepidoptera family (combined with the mite order so that two classes are represented, for running the class-level pipeline) to ensure that the smaller group does successfully run before using a lot of resources on the attempt at the larger group?

I thought that the cloud would just chug through the centroid step and possibly get stuck on the alignment2 step, if it was going to get stuck.

If there are any individually really huge BINs, it occurred to me we could do data reduction for those. For example, we could take the first 500 sequences from each huge BIN, for example.

Those are just a few thoughts on addressing this. I am happy to cover the cost. Please do save any receipts issued - thank you very much. If the bills pile up, I will likely charge the expense to a research account. In any case, of course I will reimburse you. Thank you for running this!

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

When I ran the script with the Annelida dataset, it seemed to run fine but I can try a specific Lepidoptera family with the Arachnida order and see if it works. If I can confirm that it works then I'll try Lepidoptera again with the better instance and let you know how it goes.

Thanks again for offering to cover the cost. Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for giving this a go. If you wanted to do a smaller test first, you could perhaps consider picking a family such as Geometridae (or another of the large lep families) to make sure this will run on quite a large group, combined with the mites.

Fingers crossed!

That would be very nice indeed if we can run this at the order level (even if we can't run all insects in one run), due to the missing family-level IDs we discussed earlier. So, hoping!

Of course, that's fine if Chordata sits until after Christmas. I have plenty to work on with the intro and methods over the next 2 weeks. I hope you have a great holiday!

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

I tried Sphingidae with Mesostigmata and did a quick runthrough on a really cheap instance (0.022 hour). I have confirmed all sections of the code are working properly so I'm going to now proceed with Lepidoptera and see how it goes.

Best Regards, Matt

sadamowi commented 7 years ago

OK! Thanks for doing the test and for the update! Good luck!

m-orton commented 7 years ago

Hi Sally,

Just a quick progress update.

I managed to get through the first few sections of the code 1-5 of Lepidoptera/Mesostigmata and am now chugging along on alignment 2. (I started running the code late around 9:30). It looks like alignment2 will probably take a few hours or so.

I was able to get the script to work with the r4.8xlarge instance so we dont have to upgrade to the more expensive tier. Now I'm not so certain if that error before was due to memory issues or not since I got it to work this time. Regardless, Im saving workspaces every few sections and exporting to the shared folder in a folder I created for Arthropoda so we can easily reload workspaces if R crashes again. The nice thing is I've synced the shared dropbox folder with the instance Im running so I can directly export fastas, workspaces and csv's there.

Also, Mollusca results are pretty much finished (with the new settings) so I should be able to send those results to the shared folder tomorrow.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thanks for the update! That's great to hear of your progress and success with the first sections of code.

If we need to upgrade instance, especially for a huge group like leps, that's no problem. That should be an extremely results-rich group for us, given the large size of the group as well as extensive barcoding efforts at various latitudes.

Great idea about saving workspaces at interim points in order to be able to start from a later point in the code. Fingers crossed for the next stage! And, great work for finding this cloud option. Exciting!

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

Just letting you know I got about 3 hours in with alignment2 last night and R crashed on me again lol. The alignment seems to get stuck when transferring to the second iteration. I was running a pretty powerful instance r4.8xlarge so Im not sure if upgrading the instance tier would help.

So I think to get it to work, we might have to resort to simply eliminating BIN's without family level classification as we talked about before and save that for a future project.

I suggest we do this for all of the large insect orders to ensure were not spending too much instance time on the alignment step.

Let me know your thoughts, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for trying this and for the update. We were perhaps about too hopeful about such a huge alignment.

I agree with you about going back to family level for the huge orders. We will still have a lot of results from those.

Do you have a sense of our max size to run at once? That would help us to structure what goes into each analysis and which groups may benefit from their reference seq.

Thank you very much.

It's really been a pleasure to work on this together, and I look forward to wrapping this up early in the new year. I'll try not to bug you over the break! Have a great trip and vacation.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Thanks its been great to work on this project together.

My sense is that at the family level I dont think there would be any family large enough to give troubles with alignment2. Back when I did the initial pairing results for all phyla, I was able to run locally all of the alignments of families within Arthropoda. However that was with the msa package so its hard to say with the Muscle package.

Or another thing we could try is still running the large insect orders at the order level but just eliminating all of the BIN's that dont have family level classification before the alignment2 step. I'm hoping this would reduce the size of the alignment enough that it could still be performed at the order level.

I'll be away for a few days until the 28th but let me know if you need my help on anything.

Have a great time with your family! Matt

sadamowi commented 7 years ago

Hi Matt,

I hope you are having a great Christmas vacation.

I like your idea about staying at the order level but filtering the data first to reduce the data set. This prompted me to think further about the best way to filter the data. I'd like to suggest that we consider filtering the data using lat and long. I think this is more biologically meaningful than filtering by presence/absence of a family ID.

Fauna on distant continents has evolved largely in isolation, barring the very rare dispersal event. Given our divergence cutoff for making pairs, I would expect very few pairs between Australia and North America, for example. I'm not sure if the dataset would still be too big, especially for leps, but biologically it would make a lot of sense to group the data by lat/long. For example, we could run all of North America in one run. I would expect that to be an excellent data set, given the intensive barcoding of leps in North America, particularly Canada and Costa Rica. If that dataset is too big, then I could look further into lep biogeography and suggest more refined lat/long cutoffs.

What do you think?

Best wishes,

Sally

If you think that may work, I can suggest lat/long groupings that would make sense.

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Matthew Orton notifications@github.com Sent: Friday, December 23, 2016 6:40 PM To: m-orton/R-Scripts Cc: Sarah Adamowicz; Comment Subject: Re: [m-orton/R-Scripts] Using Cloud Computing with RStudio (#26)

Hi Sally,

Thanks its been great to work on this project together.

My sense is that at the family level I dont think there would be any family large enough to give troubles with alignment2. Back when I did the initial pairing results for all phyla, I was able to run locally all of the alignments of families within Arthropoda. However that was with the msa package so its hard to say with the Muscle package.

Or another thing we could try is still running the large insect orders at the order level but just eliminating all of the BIN's that dont have family level classification before the alignment2 step. I'm hoping this would reduce the size of the alignment enough that it could still be performed at the order level.

I'll be away for a few days until the 28th but let me know if you need my help on anything.

Have a great time with your family! Matt

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/26#issuecomment-269057197, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89OrUqAnuY969tt7k8xrcz8OrEd0eYks5rLFvRgaJpZM4LNZoL.

m-orton commented 7 years ago

Hi Sally,

Hope your Christmas vacation is going well.

I think grouping by lat/lon should work well since it should be easy to specify continents in the url for the tsv download. Hopefully each region will manageable in size. Just let me know which lat/lon regions to do and I should be able to start working on it in the next couple of days since Im back from Montreal now.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

I'm glad you think that will work. I don't believe that continent is an available data field, is it?

I think our strategy for dividing up regions would depend upon the total max size of the data set that can run.

For leps, I'd like to suggest to try to run North America first to see if that will run. We could go with lat/long filtering. Or, if you think that that's easier to filter right from the point of the tsv download, we could specify the countries in the API.

I don't see the syntax clearly explained somewhere in the BOLD website. It appears, from the examples, that %20 would be needed for country names containing multiple words. Do you know if that is correct?

For North America, would geo be:

geo=Canada|United%20States|Mexico|Belize|Honduras|Nicaragua|Guatemala|Costa%20Rica|El%20Salvador|Panama

Please do check the syntax before implementing. Have I got this right?

Also, please let me know if you think this is the best approach for the filtering. If so, I can put together the lists of countries. I would also check the literature for some justification for the biogeographic regions selected.

If this North American data set is still too big to run, then I will think about that further in terms of dividing it up further. We could perhaps do tropical vs. south temperate and north temperate vs. Arctic as separate runs for North America. Other continents would have less data and should be easier to deal with.

It does seem that Lepidoptera is our biggest hurdle for this project (and also a very rich source of pairs). So, hopefully we can get this one sorted out, and then the other arthropod groups should be more feasible to run in comparison.

Happy New Year!

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

My bad on saying continent, meant to say country. I checked the bold website too and it doesnt seem to be very clear on the syntax for multiple words. But I think your syntax looks good so hopefully it will work.

Overall I think this is the best approach to the filtering. I can start with the North American countries and then if still too big we can break things up even more.

Happy New Year! Matt

sadamowi commented 7 years ago

Dear Matt,

Happy New Year!

That sounds great. Please do let me know how that goes. Fingers crossed!

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Just letting you know I`m going to start running through Lepidoptera according to region. I discovered that the initial tsv has a country column that I can include in the initial download.

To save time, I think what I will do is another full runthrough of sections 1-5 of Lepidoptera keeping this country column retained in each datatframe. Then what I can do is subset by countries in section 6 (prelim alignment) so I dont have to runthrough sections 1-5 for each geographic region. If the prelim alignment gives me troubles I start right back at section 5 and reduce the size of the geographic region being run even further.

I think I will also do this for Diptera, Hymenoptera and Coleoptera (when we get to them) in case we run into problems with those orders.

Best regards, Matt

sadamowi commented 7 years ago

Hi Matt,

OK - that sounds great. If you are downloading everything first and running through the initial sections of code first, we could consider filtering later in the pipeline by lat/long. That could be easier than generating lists of countries. However, either is fine with me, and I can help with setting the boundaries. Which do you think would work better?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, I think setting up lat long boundaries would be easier since we dont have to worry about specifying countries. Also I think I could reuse the workspace I have for Lepidoptera without doing another runthrough. Im guessing I could just filter by the median lat (mapping lat not abs) and lon of each bin then?

Best Regards, Matt

m-orton commented 7 years ago

Hi Sally, really great news!

I filtered median lat and lon according to lat and lon boundaries of North America. I set min lat as the southern most point of Panama which is 7°N (is that reasonable?) and between -180 and -50 for Lon boundaries to define North America.

This gave me a total of 17755 bin sequences to be aligned in Lepidoptera for this region. (down from roughly 56000 bin sequences globally)

I ran through the prelim alignment and it looks really great considering the size of the alignment. I just posted to dropbox if you want to take a look. There are a few minor gaps (1-2 bp) at the tail end of the alignment.

The alignment for Mesostigmata looks ok but does have a few significant gaps however.

When you get a chance, let me know if you think I should proceed to the next alignment steps (or if lat and lon boundaries need more fine tuning) and also how I should define the other geographic regions. I assume South America would be its own region but not sure about how to subdivide the rest of the world.

Best Regards, Matt

sadamowi commented 7 years ago

HI Matt,

That is great news that such a large dataset ran!

I was just now looking at some maps of biogeographic regions and trying to figure out where we might draw the lines to give a good chance of maximizing our possible number of pairs (or nearly so). That is really helpful to know that such a large dataset would run. We may not need to divide the world into too many chunks.

I'm not sure if the border of Panama is the best place to draw the line. Sometimes, genetic evidence reveals different dispersal patterns compared to traditional biogeographic regions. I will think further about this ... how to come to a reasonable decision that would be appropriate for multiple taxonomic groups. I'll try to get back to you in the morning with a suggestion.

I'll have a look at the alignments, probably tomorrow morning.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

I don't necessarily have a definitive answer at this point, but I have a few thoughts to share that I think would result in reasonable divisions for us.

A. I suggest that the lines we choose should result in 100% the world being covered, even for terrestrial groups. (For example, consider a pan-Arctic species that is found in Arctic North America and Arctic Eurasia. It could have a mean lat/long that falls in an ocean.)

B. I suggest that we ensure each region has a substantial latitudinal range to give a chance of forming pairs.

The south of Panama may not the ideal place to draw the line between the North and South American regions, based upon major biogeographical regions. I am a little torn about this. You can see here a link that has some images about classical biogeographic regions, which are based on other groups (e.g. vertebrates and plants):

http://www.els.net/WileyCDA/ElsArticle/refId-a0003231.html

Information about insect biogeography is still only partial. And, note we aren't only considering species with shared distributions among regions but also considering evolutionary similarity, including pairs up to 15% divergence.

As Costa Rica, Canada, and USA are the best-barcoded regions, if those can be analyzed together I was thinking that would likely give a lot of good pairs. See:

http://www.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=113

So, considering both biogeography and also barcode availability, I offer the following suggestions for dividing the world into four regions:

  1. North America. I suggest to use -170 longitude (Bering Strait) as the west boundary. I suggest using -15 long as the east boundary, in order to include Greenland. Greenland is generally considered part of North America, and there has been a large barcoding project there (but diversity is low). I suggest to go to +90 lat for the north boundary. For the south, I agree with you to go to the south of Panama, +7 lat. (If this runs well, we could consider going with the mid-Mexico boundary between the Nearctic and Neotropical regions for comparison with this choice.)

  2. South America. +7 lat for north boundary. -90 lat for south boundary. -30 long for east boundary. -135 long for west boundary.

  3. Australasian region. -135 long. for east boundary, +90 Long for west boundary. (Note that this region crosses 180 longitude. So, two sets of numbers may be needed to get the right region.) -90 lat for the south. zero degrees lat for the north boundary. This would ensure that Papua New Guinea is grouped with Australia. PNG is grouped with Australia, rather than Asia. These used to be connected by a land bridge and have similarities in the biotas.

  4. Eurasia + Africa. If I've specified the above correctly (I would appreciate it if you'd check against a map to make sure these lines look correct), then after removing the above three regions, then everything else that would be left could be grouped together. If this region ends up too large, then let me know.

What do you think? Thoughts welcome!

Best wishes, Sally

sadamowi commented 7 years ago

Note for Jacqueline:

Please feel free to leave comments to this thread. If you have thoughts on any of the matters we are discussing, I'd be pleased to hear your ideas.

sadamowi commented 7 years ago

Hi again Matt,

I've had a quick look at the alignments relating to this run. I agree that the lep alignment looks good, especially for a preliminary alignment. For the mite alignment, I was also a bit surprised by the large size of some of the gaps. I will check Monica Young's paper about the molecular evolution of the barcode region in arachnids to see if there are any insights there. I will also check the beginning of the alignment more carefully. It's possible that something has gone awry near the beginning and that those gaps aren't evolutionarily real in terms of being an indel. I will start another thread for that, and I'm afraid I will need to look at that next week.

Best wishes, Sally