m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

R Style Guide #33

Closed jmay29 closed 7 years ago

jmay29 commented 7 years ago

Thought this might be a good place to discuss formatting/style in R (I know I need reminding!!!).

Here is a useful link I found:

https://google.github.io/styleguide/Rguide.xml

m-orton commented 7 years ago

Thanks for posting this Jacqueline,

Think I might have to change my indentation format since I was using tab and also change the spacing for my comments.

jmay29 commented 7 years ago

Yeah, I have to change the formatting of some of my variables in my own code - I have (unfortunately) become quite accustomed to underscores in my variable names !!!

sadamowi commented 7 years ago

Hi Jacqueline and Matt,

Happy new year to you both!

This is a great guide. Thanks for sharing, Jacqueline. That is good if you are thinking about getting the code and commenting ready for publication.

However, I'd like to suggest to first complete your full run-through and checking of the interim objects. I hope to move on to generating the final results. I can then work on completing the manuscript draft while you two finalize the formatting and commenting for publication. Does that sound reasonable?

Best wishes, Sally

jmay29 commented 7 years ago

Sounds like a plan, Sally!!

m-orton commented 7 years ago

Hi Sally and Jacqueline,

In regards to formatting and commenting of the pipeline, I had a few suggestions -Change the master branch to Annelida since we have been using it as the base branch for a while now. (plus master is really out of date now lol) -In fitting with the R Style guide, comments will now have a space inbetween the hash and words, all indents will be two spaces, correct spacing will be ensured for binary operators and commas within commands. -The R Style guide says not to use attach statements but the only way the plotly plot seems to work is if I use an attach statement so I'd like to propose we make an exception in that case. -The R Style guide also says to have a 80 character limit for each line which is fine for comments and commands (i will edit lines which break this rule). However, I'd like to propose we make an exception in the case of the reference sequences since breaking the reference sequences up on to new lines introduces new line characters that interfere with some of the commands that read in sequences like dnaStringSet. -Also, it says to have a copyright statement at the top, so I think I'm just going to make reference to the GNU licence agreement (included in the R Scripts repository) by saying this at the very top of the pipeline:

"This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/."

Let me know if this all sounds good. Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

All of that sounds good to me. Thank you.

I guess there is also a question of how many branches to make public, given that the code is run various times. I suggest that the trend for research reproducibility would be to make all of the branches public that contain unique code. For example, you would also post a branch for Lepidoptera that contains the code for dividing up the BINs by geographic region. Would you mind posting that too?

Thank you very much.

Best wishes, Sally

m-orton commented 7 years ago

No problem, I will make a separate branch for Leps as well with the added code for regions.

sadamowi commented 7 years ago

Thank you Matt. Sounds good.

jmay29 commented 7 years ago

Sounds good to me, too! It is pretty difficult to keep some lines to 80 characters...especially when dealing with DNA sequences!!

m-orton commented 7 years ago

Hi Sally and Jacqueline,

Just to let you both know, I have finished making the formatting changes to the pipeline so that it adheres (mostly) to the style guide. But please be on the lookout for spaces between operators, spaces after hashes in comments and spaces between before/after round brackets of if statements. I think I got them all but the script is so long now I might have missed a few.

In regards to the 80 character rule, I have had to allow two exceptions:

  1. The initial BOLD url - I tried to break it up but it seems the url must be on one line. Please let me know if there is a way around this.
  2. Reference sequences - it is possible to break up reference sequences however then you have to add additional code that removes new line characters which adds more complexity to the code.

(As a side note apparently that 80 char rule comes originates from back when computers used punchcards lol (punchcards had 80 columns) so Im not sure if its still relevant but it does have a practical use for allowing the R Studio window to be smaller when looking at the code.)

In discussing with Sally, there are now two versions of the script on github each with its own master branch. One version handles the smaller taxa while the other handles the larger taxa (mostly large insect orders). We decided on this since having too many versions of the script becomes impractical for editing and making changes. Versions of the script specific to each taxa with reference sequences are now posted in a folder on dropbox.

This large dataset version has a few key differences including: -The addition of region code (NA, SA, AUS and EUR/AFR) for dividing large taxa. -Setting maxiters to 2 in MUSCLE alignments to reduce alignment times. -Few code rearrangements in section 10,11,12 to enable faster pairing results determination with very large datasets. -Removal of large variables in certain sections and usage of garbage collection command to reduce working memory consumption. -Usage of the data tables package for some merges (outgroup section) since data table merges are much faster than data frame merges for large datasets.

Let me know if this all sounds good.

Best Regards, Matt

jmay29 commented 7 years ago

Hi Matt!

Sounds awesome! I will check the script over for those things you mentioned. And I'm sure it's fine if some lines are over 80 characters!

sadamowi commented 7 years ago

Hi Matt,

That all sounds great to me. I think that maintaining just two versions of the script makes a lot of sense, rather than 17! Thanks for placing the versions with the taxon-specific reference sequences in dropbox. For those, we don't have to worry if there are minor edits in the future to the commenting. We would only have to re-run things if we find a problem such that incorrect results are being generated.

I suggest that this issue can be closed after Jacqueline has a change to check the script a final time, as she mentioned in the comment above.

Cheers, Sally

jmay29 commented 7 years ago

Hello! I am just checking over the script one more time right now :) !

jmay29 commented 7 years ago

Hi Matt,

Just a few lines without spaces around the operators (EvolutionaryComparisonPipelineSmallTaxa.R master branch):

Line 1042: dfPairingResultsL1 <- dfPairingResultsL1L2[dfPairingResultsL1L2$index%%2==0, ]

Line 1044: dfPairingResultsL2 <- dfPairingResultsL1L2[dfPairingResultsL1L2$index%%2>0, ]

Line 1166: dfPairingResultsL1 <- dfPairingResultsL1L2[dfPairingResultsL1L2$index%%2==0, ]

Line 1167: dfPairingResultsL2 <- dfPairingResultsL1L2[dfPairingResultsL1L2$index%%2>0, ]

Lines 1639-1646: (the addition signs) if (testOutGroupDist[i] > testOutGroupDist[i+1]) { trueTestOutGroupDist[x] <- testOutGroupDist[i] / testOutGroupDist[i+1] } else if (testOutGroupDist[i+1] > testOutGroupDist[i]) { trueTestOutGroupDist[x] = testOutGroupDist[i+1] / testOutGroupDist[i] } else if (testOutGroupDist[i+1] == testOutGroupDist[i]) { trueTestOutGroupDist[x] = testOutGroupDist[i+1] / testOutGroupDist[i]

And hashtags that need a space after them:

Line 1064: #subsetting Lineage 1 by this latitude check

Line 1066: #Now subsetting Lineage 2

Line 1068: #Now subsetting for dataframe with both lineages

Line 1132: #If overlapIndTotal is not empty:

Line 1134: #Eliminate based on that pairing number(s) for each pairing result dataframe

Line 1926: #Binom test for each class.

Line 2150: #(obtained from making an account and in settings of account details)

m-orton commented 7 years ago

Thanks very much Jacqueline, I'll make the edits today.

m-orton commented 7 years ago

This issue was moved to jmay29/lat-project#6