merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Make sure R packages are installed for anvi-get-enriched-functions-per-pan-group #1241

Closed ShaiberAlon closed 4 years ago

ShaiberAlon commented 4 years ago

anvi-get-enriched-functions-per-pan-group requires the following packages: tidyverse magrittr qvalue

The question is how do we deal with that?

  1. We can install these as part of the anvi'o installation, but that would mean R is a pre-requisite for anvi'o, which as far as I know that is not the case so far. On one hand, I doubt that there are users who have anvi'o and don't have R installed, but on the other hand, if there are, it seems draconian to suddenly require R just for the enrichment script.
  2. We can leave things for the user. If they try to run the script and they don't have R then they would get whatever error you get when you don't have R :-) and if they do have R installed then we would check for the packages and if any are missing we raise an error which packages are missing (and maybe a tip on how to install them?)

@ozcan, @meren, let me know what you think.

meren commented 4 years ago

I think it would be best to not require R to be installed at this time. If the Python program that calls this script can catch the error and communicate to the user with an exception, they would know that they need to install X and Y.

But we could include R along with these libraries in our docker image for v6.

adw96 commented 4 years ago

Possible workflow:

  1. Check for a R installation and throw an error if R isn't installed on the system.
  2. If R and the packages are installed, silently load tidyverse, magrittr and qvalue.
  3. If R is installed but the packages are not, throw an error and suggest the user install those packages with conda. I can provide the specific commands.
ShaiberAlon commented 4 years ago

@meren, there is no python program calling this. This is simply an R script.

adw96 commented 4 years ago

@ShaiberAlon you're talking about anvi-script-gen_stats_for_single_copy_genes.R?

adw96 commented 4 years ago

Conda installs:

conda install -c r r-tidyverse conda install -c bioconda r-magrittr conda install -c bioconda bioconductor-qvalue

ShaiberAlon commented 4 years ago

No, I am talking about anvi-get-enriched-functions-per-pan-group. It is simply an R script. And so if R is not installed there is no way for us to catch this. It would just be a system error, something like this:

env: Rscript: No such file or directory

@meren, are you suggesting that we would have a python wrapper that checks things are installed and only then executes the R script?

meren commented 4 years ago

are you suggesting that we would have a python wrapper

I had envisioned that anvi-get-enriched-functions-per-pan-group would be a Python program that takes in pan db, genome storage, categorical variable, etc, and it would call the R script only for the test we did not want to implement in Python. So the name of the script would be much more specific to what it is testing. In that case we could generate all the input files from within Python, and call it to produce an output file which we could read back in Python.

In that scenario, a call to this would make sure the user knows if they want this they better have R:

https://github.com/merenlab/anvio/blob/master/anvio/utils.py#L302

adw96 commented 4 years ago

@meren There should be python running on either side of the R script, i.e., the user should not directly call the R script; it should be called within a python program. Additional work is needed to generate the temp txt file with the presence-absence data (@ShaiberAlon gave me the desired input and output formats), and then to close out the command.

adw96 commented 4 years ago

Essentially the R script replaces some component of the existing python-only workflow; but not all of it.

ShaiberAlon commented 4 years ago

Ok, I’m on it. I’ll make it that way. But I’ll only get to do that on Thursday, since tomorrow I’m traveling.

adw96 commented 4 years ago

Thanks @ShaiberAlon! You may wish to change the input arguments to the R script if needed. The format is quite straightforward and you can follow the lead of the current input arguments; let me know if you have questions.

ShaiberAlon commented 4 years ago

Just FYI that a python script to prepare the input was already there (anvi-get-functional-occurrence-summary-per-pan-group and below I pasted the help menu for this program), but it is not a wrapper, instead what I thought originally is that the user would call the python script to generate the input for the R script, and then the user would call the R script. But I think to have one python script that does everything is better.

$ anvi-get-functional-occurrence-summary-per-pan-group -h
usage: anvi-get-functional-occurrence-summary-per-pan-group
       [-h] -p PAN_DB [-g GENOMES_STORAGE] [--category-variable CATEGORY]
       [--annotation-source SOURCE NAME] [-l] -o FILE_PATH [-F FILE]
       [--exclude-ungrouped]

A program that takes a pangenome, and a categorical layers additional data
item, and generates the input for anvi-get-enriched-functions-per-pan-group.
If requested a functional occurrence table across genomes is also generated.

optional arguments:
  -h, --help            show this help message and exit

INPUT FILES:
  Input files from the pangenome analysis.

  -p PAN_DB, --pan-db PAN_DB
                        Anvi'o pan database
  -g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE
                        Anvi'o genomes storage file

CATEGORY VARIABLE AND FUNCTIONAL ANNOTATION SOURCE:
  This is the layers additional data item in which your genomes are split
  into multiple groups. So anvi'o can figure out what functions are specific
  to each group of genomes in your pangenomic analysis. If this is not
  making any sense, please take a look at the online tutorial for
  pangenomics (http://merenlab.org/2016/11/08/pangenomics-v2/).

  --category-variable CATEGORY
                        The additional layers data variable name that divides
                        layers into multiple categories.
  --annotation-source SOURCE NAME
                        Get functional annotations for a specific annotation
                        source. You can use the flag '--list-annotation-
                        sources' to learn about what sources are available.
  -l, --list-annotation-sources
                        List available functional annotation sources.

REPORTING:
  Output and stuff.

  -o FILE_PATH, --output-file FILE_PATH
                        File path to store results.
  -F FILE, --functional-occurrence-table-output FILE
                        Saves the occurrence frequency information for
                        functions in genomes in a TAB-delimited format. A file
                        name must be provided. To learn more about how the
                        functional occurrence is computed, please refer to the
                        tutorial.

OPTIONAL PARAMETERS:
  Parameters to help you filter the output.

  --exclude-ungrouped   Use this flag if you want anvi'o to ignore genomes
                        with no value set for the catergory variable (which
                        you specified using --category-variable). By default
                        all variables with no value will be considered as a
                        single group when preforming the statistical analysis.
ShaiberAlon commented 4 years ago

@meren, @adw96 , I completed the wrapper for this, including the sanity check for R and for the packages (https://github.com/merenlab/anvio/commit/91f9cf1531febdbf96feb74c3a68747b91e868de). Currently the sanity check for packages is within the R script. It probably should be in the python wrapper, but I didn't want to spend the time to figure out how to do that right now.

@meren, please try to test this on your local. You should be able to use the Prochlorococcus example just like in the pangenomics tutorial:

anvi-get-enriched-functions-per-pan-group -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                                          -g PROCHLORO-GENOMES.db \
                                          --category light \
                                          --annotation-source COG_FUNCTION \
                                          -o PROCHLORO-PAN-enriched-functions-light.txt

I currently get an error in the log file that I created for the R script:

Error: Column `function_accession` can't be modified because it's a grouping variable
Execution halted
mooreryan commented 4 years ago

If it helps at all, I added a code comment here with a way to check for required R packages from within your python script.

xvazquezc commented 4 years ago

Hopefully, I'm not late....

You can just check for R, and let the Rscript check for the packages. If the user has them installed they'll load, otherwise they'll be installed:

is_installed = function(mypkg) is.element(mypkg, installed.packages()[,1])  
# Install it if it isn't already installed
# Run a for-loop of all the package names listed below in the function call
# with the list of packages: load_or_install(c("pkg1", "pkg2",..., "pkgn"))
load_or_install = function(package_names)  
{  
  for(package_name in package_names)  
  {  
    if(!is_installed(package_name))  
    {  
      install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN")
    }  
    library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE)  
  }  
}  
# library(tools) has the file_path_sans_ext(filename) function

load_or_install(c("tidyverse", "magrittr", "qvalue"))

Alternatively, you can use pacman for the package management, although you still need to make sure you have it installed... although it does a better job if you require specific versions or load packages from github. This is an example adapted from some pipeline we have been developing:

load_or_install = function(package_names)  
{  
  for(package_name in package_names)  
  {  
    if(!is_installed(package_name))  
    {  
      install.packages(package_name,repos="http://lib.stat.cmu.edu/R/CRAN")
    }  
    library(package_name,character.only=TRUE,quietly=TRUE,verbose=FALSE)  
  }  
}  

## Only use this fx for pacman and BiocManager (in case you use something from bioconductor)
load_or_install(c("pacman", "BiocManager")) 

## Pacman commands
### if you need to load from github
p_load_gh(c("tidyverse/ggplot2")) 

### anything else will be loaded from either CRAN or Bioconductor. Pacman will locate them without having to specify the source 
p_load("knitr",
       "kableExtra",
        "dplyr", "tidyr",
       "rmarkdown",
       "grid", "gridBase", "gridExtra",
       "phyloseq",
       "plotly")
meren commented 4 years ago

@xvazquezc you should feel free to edit the code directly if you ask me :) clearly you are the expert :)

meren commented 4 years ago

(you are already listed as a developer, so you have all the authority :p)

xvazquezc commented 4 years ago

It was mostly a suggestion, you know... I don't know if you prefer to control the packages with Python or else... If I have a moment I'll submit something today

meren commented 4 years ago

No pressure at all! I just wanted to make sure it is clear that you are welcome. Of course your suggestion is invaluable, too.

ShaiberAlon commented 4 years ago

@xvazquezc, thank you for the suggestion!

There are two reasons why I like @mooreryan's suggestion better:

  1. I would like to control things from the python wrapper, since otherwise we run a bunch of preliminary steps in python before the R script just to get an error once we get to the R part.
  2. I prefer to leave it for the user to choose if and how to install the packages (is that silly?)