Closed ShaiberAlon closed 4 years ago
I think it would be best to not require R to be installed at this time. If the Python program that calls this script can catch the error and communicate to the user with an exception, they would know that they need to install X and Y.
But we could include R along with these libraries in our docker image for v6
.
Possible workflow:
R
installation and throw an error if R
isn't installed on the system. R
and the packages are installed, silently load tidyverse
, magrittr
and qvalue
. R
is installed but the packages are not, throw an error and suggest the user install those packages with conda. I can provide the specific commands. @meren, there is no python program calling this. This is simply an R
script.
@ShaiberAlon you're talking about anvi-script-gen_stats_for_single_copy_genes.R
?
Conda installs:
conda install -c r r-tidyverse
conda install -c bioconda r-magrittr
conda install -c bioconda bioconductor-qvalue
No, I am talking about anvi-get-enriched-functions-per-pan-group
. It is simply an R
script. And so if R
is not installed there is no way for us to catch this. It would just be a system error, something like this:
env: Rscript: No such file or directory
@meren, are you suggesting that we would have a python wrapper that checks things are installed and only then executes the R
script?
are you suggesting that we would have a python wrapper
I had envisioned that anvi-get-enriched-functions-per-pan-group
would be a Python program that takes in pan db, genome storage, categorical variable, etc, and it would call the R script only for the test we did not want to implement in Python. So the name of the script would be much more specific to what it is testing. In that case we could generate all the input files from within Python, and call it to produce an output file which we could read back in Python.
In that scenario, a call to this would make sure the user knows if they want this they better have R:
https://github.com/merenlab/anvio/blob/master/anvio/utils.py#L302
@meren There should be python running on either side of the R script, i.e., the user should not directly call the R script; it should be called within a python program. Additional work is needed to generate the temp txt file with the presence-absence data (@ShaiberAlon gave me the desired input and output formats), and then to close out the command.
Essentially the R script replaces some component of the existing python-only workflow; but not all of it.
Ok, I’m on it. I’ll make it that way. But I’ll only get to do that on Thursday, since tomorrow I’m traveling.
Thanks @ShaiberAlon! You may wish to change the input arguments to the R script if needed. The format is quite straightforward and you can follow the lead of the current input arguments; let me know if you have questions.
Just FYI that a python script to prepare the input was already there (anvi-get-functional-occurrence-summary-per-pan-group
and below I pasted the help menu for this program), but it is not a wrapper, instead what I thought originally is that the user would call the python script to generate the input for the R
script, and then the user would call the R
script. But I think to have one python script that does everything is better.
$ anvi-get-functional-occurrence-summary-per-pan-group -h
usage: anvi-get-functional-occurrence-summary-per-pan-group
[-h] -p PAN_DB [-g GENOMES_STORAGE] [--category-variable CATEGORY]
[--annotation-source SOURCE NAME] [-l] -o FILE_PATH [-F FILE]
[--exclude-ungrouped]
A program that takes a pangenome, and a categorical layers additional data
item, and generates the input for anvi-get-enriched-functions-per-pan-group.
If requested a functional occurrence table across genomes is also generated.
optional arguments:
-h, --help show this help message and exit
INPUT FILES:
Input files from the pangenome analysis.
-p PAN_DB, --pan-db PAN_DB
Anvi'o pan database
-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
Anvi'o genomes storage file
CATEGORY VARIABLE AND FUNCTIONAL ANNOTATION SOURCE:
This is the layers additional data item in which your genomes are split
into multiple groups. So anvi'o can figure out what functions are specific
to each group of genomes in your pangenomic analysis. If this is not
making any sense, please take a look at the online tutorial for
pangenomics (http://merenlab.org/2016/11/08/pangenomics-v2/).
--category-variable CATEGORY
The additional layers data variable name that divides
layers into multiple categories.
--annotation-source SOURCE NAME
Get functional annotations for a specific annotation
source. You can use the flag '--list-annotation-
sources' to learn about what sources are available.
-l, --list-annotation-sources
List available functional annotation sources.
REPORTING:
Output and stuff.
-o FILE_PATH, --output-file FILE_PATH
File path to store results.
-F FILE, --functional-occurrence-table-output FILE
Saves the occurrence frequency information for
functions in genomes in a TAB-delimited format. A file
name must be provided. To learn more about how the
functional occurrence is computed, please refer to the
tutorial.
OPTIONAL PARAMETERS:
Parameters to help you filter the output.
--exclude-ungrouped Use this flag if you want anvi'o to ignore genomes
with no value set for the catergory variable (which
you specified using --category-variable). By default
all variables with no value will be considered as a
single group when preforming the statistical analysis.
@meren, @adw96 , I completed the wrapper for this, including the sanity check for R
and for the packages (https://github.com/merenlab/anvio/commit/91f9cf1531febdbf96feb74c3a68747b91e868de).
Currently the sanity check for packages is within the R script. It probably should be in the python wrapper, but I didn't want to spend the time to figure out how to do that right now.
@meren, please try to test this on your local. You should be able to use the Prochlorococcus
example just like in the pangenomics tutorial:
anvi-get-enriched-functions-per-pan-group -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
-g PROCHLORO-GENOMES.db \
--category light \
--annotation-source COG_FUNCTION \
-o PROCHLORO-PAN-enriched-functions-light.txt
I currently get an error in the log file that I created for the R
script:
Error: Column `function_accession` can't be modified because it's a grouping variable
Execution halted
If it helps at all, I added a code comment here with a way to check for required R packages from within your python script.
Hopefully, I'm not late....
You can just check for R, and let the Rscript check for the packages. If the user has them installed they'll load, otherwise they'll be installed:
is_installed = function(mypkg) is.element(mypkg, installed.packages()[,1])
# Install it if it isn't already installed
# Run a for-loop of all the package names listed below in the function call
# with the list of packages: load_or_install(c("pkg1", "pkg2",..., "pkgn"))
load_or_install = function(package_names)
{
for(package_name in package_names)
{
if(!is_installed(package_name))
{
install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN")
}
library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE)
}
}
# library(tools) has the file_path_sans_ext(filename) function
load_or_install(c("tidyverse", "magrittr", "qvalue"))
Alternatively, you can use pacman
for the package management, although you still need to make sure you have it installed... although it does a better job if you require specific versions or load packages from github. This is an example adapted from some pipeline we have been developing:
load_or_install = function(package_names)
{
for(package_name in package_names)
{
if(!is_installed(package_name))
{
install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN")
}
library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE)
}
}
## Only use this fx for pacman and BiocManager (in case you use something from bioconductor)
load_or_install(c("pacman", "BiocManager"))
## Pacman commands
### if you need to load from github
p_load_gh(c("tidyverse/ggplot2"))
### anything else will be loaded from either CRAN or Bioconductor. Pacman will locate them without having to specify the source
p_load("knitr",
"kableExtra",
"dplyr", "tidyr",
"rmarkdown",
"grid", "gridBase", "gridExtra",
"phyloseq",
"plotly")
@xvazquezc you should feel free to edit the code directly if you ask me :) clearly you are the expert :)
(you are already listed as a developer, so you have all the authority :p)
It was mostly a suggestion, you know... I don't know if you prefer to control the packages with Python or else... If I have a moment I'll submit something today
No pressure at all! I just wanted to make sure it is clear that you are welcome. Of course your suggestion is invaluable, too.
@xvazquezc, thank you for the suggestion!
There are two reasons why I like @mooreryan's suggestion better:
R
script just to get an error once we get to the R
part.
anvi-get-enriched-functions-per-pan-group
requires the following packages:tidyverse
magrittr
qvalue
The question is how do we deal with that?
R
is a pre-requisite for anvi'o, which as far as I know that is not the case so far. On one hand, I doubt that there are users who have anvi'o and don't haveR
installed, but on the other hand, if there are, it seems draconian to suddenly requireR
just for the enrichment script.R
then they would get whatever error you get when you don't haveR
:-) and if they do haveR
installed then we would check for the packages and if any are missing we raise an error which packages are missing (and maybe a tip on how to install them?)@ozcan, @meren, let me know what you think.