m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Sally tasks for next week (note on Dec 7th) #23

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Notes about my tasks:

  1. update R (to very very secure dishes)
  2. rerun Annelida script to ensure everything works smoothly with new R version (and with the newest updates to the script from Matt, revisions made Dec. 7th)
  3. compare results to previous R version I ran (sincere pumpkin patch) to assess consistency
  4. at that point, if all seems good, contact Jacqueline to run Annelida script on an additional taxon and to evaluate interim steps (remember to mention R versioning issue)
  5. manually generate a script file with the reference sequences for Cnidaria and Echinodermata
  6. run those phyla
  7. test alignment settings (number of iterations - can we reduce?)
  8. assess consistency of results across trimming lengths (600 bp, 620 bp, 640 bp)
  9. more forward with the remaining three phyla (Mollusca, Chordata, Echinodermata)
  10. continue work on manuscript prose

(For step #9, I was wondering if you'd be willing to try to run Mollusca at the class level on your better computer? If that's possible, I think that would be a better choice for that phylum. However, if we do end up needing to go with order, we would lose some unidentified sequences, or sequences identified using an alternative taxonomic hierarchy, but I think that isn't catastrophic for the project. I contacted Compute Canada, but they indicated they are still awaiting approval from Guelph for my account. Hopefully that will be sorted out soon and so we would hopefully have access to more computing resources if needed.)

Let me know if I missed something!

Cheers, Sally

m-orton commented 7 years ago

Sounds good to me, i'll update to secure dishes and then run through Mollusca at the class level. Did you want me to use placeholder references for Mollusca?

Also, I will try to address the other errors with the plot and plotly map. If you think the Annelida script is in a good state now, I could also work on an updated order level analysis script as well if you want?

Also, really glad the results match now!

Best Regards, Matt

sadamowi commented 7 years ago

All great news! Thanks! I will send real molluscs seqs. I think we want to aim for real results now. Will reply more tomorrow.

Sent from Samsung Mobile

-------- Original message -------- From: Matthew Orton Date:12-07-2016 11:28 PM (GMT-05:00) To: m-orton/R-Scripts Cc: Sarah Adamowicz , Author Subject: Re: [m-orton/R-Scripts] Sally tasks for next week (note on Dec 7th) (#23)

Sounds good to me, i'll update to secure dishes and then run through Mollusca at the class level. Did you want me to use placeholder references for Mollusca?

Also, I will try to address the other errors with the plot and plotly map. If you think the Annelida script is in a good state now, I could also work on an updated order level analysis script as well if you want?

Also, really glad the results match now!

Best Regards, Matt

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/23#issuecomment-265650274, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89Ok3svIZYyIi93rFULEk62XhTQ0JRks5rF4d4gaJpZM4LHa33.

sadamowi commented 7 years ago

Hi again Matt,

Thank you for having a look at the error messages relating to the plotting. I think those plots are helpful, and it would be great if you are able to sort out those functions to run in the updated version of R.

Have you been in contact with Winfield? Do you know if he is also actively testing the code? I suggest to mention the versioning decision to him as well as the Annelida update. Thank you.

I will prepare Mollusca sequences and hopefully send those today. Does that work for you, given other commitments, if your computer has to chug away for a day or two on that larger phylum?

Thank you for trying that. It occurred to me that at the very least we should be able to speed up the alignment step at the centroid stage. Within BINs, sequence variability is very low, often <1% and usually up to maximum of about 2.5%. So, we should be able to gain alignment efficiency there. That would be worth considering.

About the order-level analysis, I suggest that we check whether Jacqueline can check the Annelida code in the near future. It would seem most efficient to proceed to the order pipeline after we are confident in the final version of the class pipeline. I'd like to be mindful of your other commitments. What do you think?

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

I haven't heard from Winfield in a few days but ill contact him and let him know about the script changes and the updated versioning.

No problem on Mollusca, I should be able to run through it this weekend and let you know how it goes.

For the centroid alignment, maybe we could set the diags setting to True on the muscle command to speed it up more? For Annelida it seemed like it was able to run through each BIN quite quickly but maybe for larger taxa it would become useful to have this setting turned on.

As for the order level analysis, I agree that it would be good for Jacqueline to take a look at the Annelida script first before proceeding further.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thanks for touching touch with Winfield.

I agree about your suggested setting change for the centroid alignment step. Indeed, there are many more total sequences and BINs for Mollusca (particularly class Gastropoda) than for Annelida.

Would you mind making that change in the Annelida branch too? That way, I would use the same setting as you as I run through the Annelida code a final time with the newest R version.

Also, as an update ... I received my Compute Canada renewal notice today. I have asked Jacqueline if she has example job files. I understand the procedure for submitting jobs may be somewhat different compared to what the McGill folks use for the Quebec cluster. So, hopefully I can obtain an example file so that we can use that resource too, as needed. I would plan to submit a small job first, one we can also run on a local computer, for comparison prior to moving to a big task like Arthropoda.

Cheers,

Sally

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Matthew Orton notifications@github.com Sent: Thursday, December 8, 2016 11:12:42 AM To: m-orton/R-Scripts Cc: Sarah Adamowicz; Author Subject: Re: [m-orton/R-Scripts] Sally tasks for next week (note on Dec 7th) (#23)

Hi Sally,

I haven't heard from Winfield in a few days but ill contact him and let him know about the script changes and the updated versioning.

No problem on Mollusca, I should be able to run through it this weekend and let you know how it goes.

For the centroid alignment, maybe we could set the diags setting to True on the muscle command to speed it up more? For Annelida it seemed like it was able to run through each BIN quite quickly but maybe for larger taxa it would become useful to have this setting turned on.

As for the order level analysis, I agree that it would be good for Jacqueline to take a look at the Annelida script first before proceeding further.

Best Regards, Matt

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/23#issuecomment-265780041, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89OhzjIndxhcwpf-GtAUlYSpGIDPQaks5rGCx6gaJpZM4LHa33.

m-orton commented 7 years ago

Hi Sally,

Winfield just got back to me confirming he would help test the Annelida code. I made sure to mention about the updated versioning as well.

I also just updated the script and set diags to true in the muscle command for the Annelida branch as well.

Good to hear about Compute Canada, hopefully it will be a useful resource for us.

Best Regards, Matt

sadamowi commented 7 years ago

Thank you very much Matt. I will update you as I complete the above issues.

Cheers, Sally

sadamowi commented 7 years ago

http://www.goodreads.com/quotes/359519-lucy-was-using-my-blanket-to-dry-the-dishes-we

m-orton commented 7 years ago

Haha, guess we know where these version names are coming from.

sadamowi commented 7 years ago

Hi Matt,

I am happy to report that steps 1-3 above are complete. I got the same results using the newer R (dishes) compared to the previous version (pumpkin). As well, with the exception of the previous errors relating to plotting (and I think one new package that seems to be needed), everything ran smoothly, and there were no new errors.

I like how you have the FASTA files now optional for exporting but in a streamlined set of commands, covering all classes present, without repeating the alignment.

I will proceed with the other tasks.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

Glad you like the FASTA commands and the script is running smoothly. I'm currently running through the centroid alignments for Mollusca. Seems to be running well so far.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

I am happy to report that tasks 1-9 above are complete. I will close this issue and generate a new task list.

Cheers, Sally