Error when running functional enrichment

ShaiberAlon commented 5 years ago

When I run anvi-get-enriched-functions-per-pan-group, I get the following error:

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:
              /tmp/tmpf88fzvye

And the aforementioned log file includes this error:

Error: Column `function_accession` can't be modified because it's a grouping variable
Execution halted

Potential solution - R version

I am using:

R version 3.5.1 (2018-07-02) -- "Feather Spray"

So this could be an issue with my local version, but if so, then maybe we should add something to check R version, or at least include a message that we require a certain minimal version of R.

I will try installing the latest R version and test this again.

Reproducing this

To reproduce this, you can download the following data package: https://drive.google.com/file/d/1crwvvDpK_AqC2ngivcZfETOyj7brDhPL/view?usp=sharing

Uncompress the data folder and cd into it:

tar -xzvf TEST-PACKAGE.tar.gz
cd TEST-PACKAGE

And then run the enrichment test:

anvi-get-enriched-functions-per-pan-group -p PAN.db \
                                          -g GENOMES.db \
                                          -o Functional_enrichment_2_groups.txt \
                                          --category-variable light \
                                          --annotation-source COG_FUNCTION

ShaiberAlon commented 5 years ago

I ran this again with the latest R version:

R version 3.6.1 (2019-07-05) -- "Action of the Toes"

And, sadly, I still get the same error.

@adw96 , do you have an idea regarding where this error is coming from? Maybe some R expert in your lab could help me get to the bottom of this?

meren commented 5 years ago

@ShaiberAlon, when I run the enrichment test using your TEST-PACKAGE.tar.gz this is how things went:

First try:

anvi-get-enriched-functions-per-pan-group -p PAN.db \
>                                           -g GENOMES.db \
>                                           -o Functional_enrichment_2_groups.txt \
>                                           --category-variable light \
>                                           --annotation-source COG_FUNCTION
Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Config Error: The following R packages are required in order to run this program, but are
              missing: qvalue. You can install these packages using conda by running the
              following commands: "conda install -c bioconda bioconductor-qvalue"

I first tried to install bioconductor-qvalue through the R terminal, and of course it wasn't available. So I tried the conda, and this is how it went:

## Package Plan ##

  environment location: /Users/meren/miniconda3

  added / updated specs:
    - bioconductor-qvalue

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _r-mutex-1.0.0             |      anacondar_1           2 KB
    bioconductor-qvalue-2.8.0  |                0         2.7 MB  bioconda
    ca-certificates-2019.8.28  |                0         133 KB
    certifi-2019.9.11          |           py36_0         154 KB
    conda-4.7.12               |           py36_0         3.0 MB
    curl-7.61.1                |       ha441bb4_0         122 KB
    icu-58.2                   |       h4b95b61_1        10.1 MB
    jpeg-9b                    |       he5867d9_2         201 KB
    libgcc-4.8.5               |      hdbeacc1_10         250 KB
    libpng-1.6.37              |       ha441bb4_0         262 KB
    libtiff-4.0.10             |       hcb84e12_2         394 KB
    openssl-1.0.2t             |       h1de35cc_1         2.0 MB
    pcre-8.43                  |       h0a44026_0         185 KB
    r-3.3.1                    |         r3.3.1_0          620 B
    r-base-3.3.1               |                0        47.2 MB
    r-boot-1.3_18              |         r3.3.1_0         576 KB
    r-class-7.3_14             |         r3.3.1_0          82 KB
    r-cluster-2.0.4            |         r3.3.1_0         472 KB
    r-codetools-0.2_14         |         r3.3.1_0          45 KB
    r-colorspace-1.2_6         |         r3.3.1_0         375 KB
    r-dichromat-2.0_0          |         r3.3.1_2         146 KB
    r-digest-0.6.9             |         r3.3.1_0         113 KB
    r-foreign-0.8_66           |         r3.3.1_0         225 KB
    r-ggplot2-2.1.0            |         r3.3.1_0         2.0 MB  bioconda
    r-gtable-0.2.0             |         r3.3.1_0          57 KB
    r-kernsmooth-2.23_15       |         r3.3.1_0          81 KB
    r-labeling-0.3             |         r3.3.1_2          40 KB
    r-lattice-0.20_33          |         r3.3.1_0         697 KB
    r-magrittr-1.5             |         r3.3.1_2         155 KB
    r-mass-7.3_45              |         r3.3.1_0         1.0 MB
    r-matrix-1.2_6             |         r3.3.1_0         3.1 MB
    r-mgcv-1.8_12              |         r3.3.1_0         1.9 MB
    r-munsell-0.4.3            |         r3.3.1_0         130 KB
    r-nlme-3.1_128             |         r3.3.1_0         2.0 MB
    r-nnet-7.3_12              |         r3.3.1_0          98 KB
    r-plyr-1.8.4               |         r3.3.1_0         738 KB
    r-rcolorbrewer-1.1_2       |         r3.3.1_3          28 KB
    r-rcpp-0.12.5              |         r3.3.1_0         2.2 MB
    r-recommended-3.3.1        |         r3.3.1_0          767 B
    r-reshape2-1.4.1           |         r3.3.1_2         103 KB
    r-rpart-4.1_10             |         r3.3.1_0         863 KB
    r-scales-0.4.1             |         r3.3.1_1         204 KB  bioconda
    r-spatial-7.3_11           |         r3.3.1_0         121 KB
    r-stringi-1.1.1            |         r3.3.1_0        10.8 MB
    r-stringr-1.1.0            |         r3.3.1_0         113 KB  bioconda
    r-survival-2.39_4          |         r3.3.1_0         4.5 MB
    ------------------------------------------------------------
                                           Total:        99.5 MB

The following NEW packages will be INSTALLED:

  _r-mutex           pkgs/r/osx-64::_r-mutex-1.0.0-anacondar_1
  bioconductor-qval~ bioconda/osx-64::bioconductor-qvalue-2.8.0-0
  curl               pkgs/main/osx-64::curl-7.61.1-ha441bb4_0
  icu                pkgs/main/osx-64::icu-58.2-h4b95b61_1
  jpeg               pkgs/main/osx-64::jpeg-9b-he5867d9_2
  libgcc             pkgs/main/osx-64::libgcc-4.8.5-hdbeacc1_10
  libpng             pkgs/main/osx-64::libpng-1.6.37-ha441bb4_0
  libtiff            pkgs/main/osx-64::libtiff-4.0.10-hcb84e12_2
  pcre               pkgs/main/osx-64::pcre-8.43-h0a44026_0
  r                  pkgs/r/osx-64::r-3.3.1-r3.3.1_0
  r-base             pkgs/r/osx-64::r-base-3.3.1-0
  r-boot             pkgs/r/osx-64::r-boot-1.3_18-r3.3.1_0
  r-class            pkgs/r/osx-64::r-class-7.3_14-r3.3.1_0
  r-cluster          pkgs/r/osx-64::r-cluster-2.0.4-r3.3.1_0
  r-codetools        pkgs/r/osx-64::r-codetools-0.2_14-r3.3.1_0
  r-colorspace       pkgs/r/osx-64::r-colorspace-1.2_6-r3.3.1_0
  r-dichromat        pkgs/r/osx-64::r-dichromat-2.0_0-r3.3.1_2
  r-digest           pkgs/r/osx-64::r-digest-0.6.9-r3.3.1_0
  r-foreign          pkgs/r/osx-64::r-foreign-0.8_66-r3.3.1_0
  r-ggplot2          bioconda/osx-64::r-ggplot2-2.1.0-r3.3.1_0
  r-gtable           pkgs/r/osx-64::r-gtable-0.2.0-r3.3.1_0
  r-kernsmooth       pkgs/r/osx-64::r-kernsmooth-2.23_15-r3.3.1_0
  r-labeling         pkgs/r/osx-64::r-labeling-0.3-r3.3.1_2
  r-lattice          pkgs/r/osx-64::r-lattice-0.20_33-r3.3.1_0
  r-magrittr         pkgs/r/osx-64::r-magrittr-1.5-r3.3.1_2
  r-mass             pkgs/r/osx-64::r-mass-7.3_45-r3.3.1_0
  r-matrix           pkgs/r/osx-64::r-matrix-1.2_6-r3.3.1_0
  r-mgcv             pkgs/r/osx-64::r-mgcv-1.8_12-r3.3.1_0
  r-munsell          pkgs/r/osx-64::r-munsell-0.4.3-r3.3.1_0
  r-nlme             pkgs/r/osx-64::r-nlme-3.1_128-r3.3.1_0
  r-nnet             pkgs/r/osx-64::r-nnet-7.3_12-r3.3.1_0
  r-plyr             pkgs/r/osx-64::r-plyr-1.8.4-r3.3.1_0
  r-rcolorbrewer     pkgs/r/osx-64::r-rcolorbrewer-1.1_2-r3.3.1_3
  r-rcpp             pkgs/r/osx-64::r-rcpp-0.12.5-r3.3.1_0
  r-recommended      pkgs/r/osx-64::r-recommended-3.3.1-r3.3.1_0
  r-reshape2         pkgs/r/osx-64::r-reshape2-1.4.1-r3.3.1_2
  r-rpart            pkgs/r/osx-64::r-rpart-4.1_10-r3.3.1_0
  r-scales           bioconda/osx-64::r-scales-0.4.1-r3.3.1_1
  r-spatial          pkgs/r/osx-64::r-spatial-7.3_11-r3.3.1_0
  r-stringi          pkgs/r/osx-64::r-stringi-1.1.1-r3.3.1_0
  r-stringr          bioconda/osx-64::r-stringr-1.1.0-r3.3.1_0
  r-survival         pkgs/r/osx-64::r-survival-2.39_4-r3.3.1_0

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2019.6.1~ --> pkgs/main::ca-certificates-2019.8.28-0
  certifi             conda-forge::certifi-2019.6.16-py36_1 --> pkgs/main::certifi-2019.9.11-py36_0
  conda                    conda-forge::conda-4.7.10-py36_0 --> pkgs/main::conda-4.7.12-py36_0
  openssl            conda-forge::openssl-1.0.2r-h1de35cc_0 --> pkgs/main::openssl-1.0.2t-h1de35cc_1

Then I got this error:

anvi-get-enriched-functions-per-pan-group -p PAN.db \
>                                           -g GENOMES.db \
>                                           -o Functional_enrichment_2_groups.txt \
>                                           --category-variable light \
>                                           --annotation-source COG_FUNCTION
Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Traceback for debugging
================================================================================
  File "/Users/meren/github/anvio/bin/anvi-get-enriched-functions-per-pan-group", line 69, in <module>
    main(args)
  File "/Users/meren/github/anvio/bin/anvi-get-enriched-functions-per-pan-group", line 38, in main
    s.functional_enrichment_stats()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 333, in functional_enrichment_stats
    ret_val = utils.run_command(["Rscript", "-e", "library('%s')" % lib], log_file)
  File "/Users/meren/github/anvio/anvio/utils.py", line 396, in run_command
    raise ConfigError("command was terminated")
================================================================================

Config Error: command was terminated

Then, this is what R says:

(base) (anvio-master) meren ~/Downloads/TEST-PACKAGE $ R
dyld: Library not loaded: @rpath/libicuuc.54.dylib
  Referenced from: /Users/meren/miniconda3/lib/R/lib/libR.dylib
  Reason: image not found
Abort trap: 6

As a result, I first run this,

conda uninstall r r-base

And now am back to this:

anvi-get-enriched-functions-per-pan-group -p PAN.db \
>                                           -g GENOMES.db \
>                                           -o Functional_enrichment_2_groups.txt \
>                                           --category-variable light \
>                                           --annotation-source COG_FUNCTION
Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Traceback for debugging
================================================================================
  File "/Users/meren/github/anvio/bin/anvi-get-enriched-functions-per-pan-group", line 69, in <module>
    main(args)
  File "/Users/meren/github/anvio/bin/anvi-get-enriched-functions-per-pan-group", line 38, in main
    s.functional_enrichment_stats()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 342, in functional_enrichment_stats
    ', '.join(['"%s"' % package_dict[i] for i in missing_packages])))
================================================================================

Config Error: The following R packages are required in order to run this program, but are
              missing: qvalue. You can install these packages using conda by running the
              following commands: "conda install -c bioconda bioconductor-qvalue"

R version is this:

(base) (anvio-master) meren ~/Downloads/TEST-PACKAGE $ R

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin18.2.0 (64-bit)

I hope it is useful for anything.

ShaiberAlon commented 5 years ago

@meren, do you mind installing the latest version of R and trying again? To see if then the qvalue package installs properly?

I installed it using conda after installing the latest R and it went ok.

meren commented 5 years ago

@ShaiberAlon, I took your advice and installed the latest version:

(base) (anvio-master) meren ~/Downloads/TEST-PACKAGE $ R

R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

Then I could get the same missing package error:

anvi-get-enriched-functions-per-pan-group -p PAN.db \
>                                           -g GENOMES.db \
>                                           -o Functional_enrichment_2_groups.txt \
>                                           --category-variable light \
>                                           --annotation-source COG_FUNCTION

Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Config Error: The following R packages are required in order to run this program, but are
              missing: qvalue. You can install these packages using conda by running the
              following commands: "conda install -c bioconda bioconductor-qvalue"

Trying to install conda gave me this output, and I nope'd the F-out:

conda install -c bioconda bioconductor-qvalue
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/meren/miniconda3

  added / updated specs:
    - bioconductor-qvalue

The following NEW packages will be INSTALLED:

  _r-mutex           pkgs/r/osx-64::_r-mutex-1.0.0-anacondar_1
  bioconductor-qval~ bioconda/osx-64::bioconductor-qvalue-2.8.0-0
  curl               pkgs/main/osx-64::curl-7.61.1-ha441bb4_0
  icu                pkgs/main/osx-64::icu-58.2-h4b95b61_1
  jpeg               pkgs/main/osx-64::jpeg-9b-he5867d9_2
  libgcc             pkgs/main/osx-64::libgcc-4.8.5-hdbeacc1_10
  libiconv           pkgs/main/osx-64::libiconv-1.15-hdd342a3_7
  libpng             pkgs/main/osx-64::libpng-1.6.37-ha441bb4_0
  libtiff            pkgs/main/osx-64::libtiff-4.0.10-hcb84e12_2
  libxml2            pkgs/main/osx-64::libxml2-2.9.9-hf6e021a_1
  pcre               pkgs/main/osx-64::pcre-8.43-h0a44026_0
  r                  pkgs/r/osx-64::r-3.3.1-r3.3.1_0
  r-base             pkgs/r/osx-64::r-base-3.3.1-0
  r-boot             pkgs/r/osx-64::r-boot-1.3_18-r3.3.1_0
  r-class            pkgs/r/osx-64::r-class-7.3_14-r3.3.1_0
  r-cluster          pkgs/r/osx-64::r-cluster-2.0.4-r3.3.1_0
  r-codetools        pkgs/r/osx-64::r-codetools-0.2_14-r3.3.1_0
  r-colorspace       pkgs/r/osx-64::r-colorspace-1.2_6-r3.3.1_0
  r-dichromat        pkgs/r/osx-64::r-dichromat-2.0_0-r3.3.1_2
  r-digest           pkgs/r/osx-64::r-digest-0.6.9-r3.3.1_0
  r-foreign          pkgs/r/osx-64::r-foreign-0.8_66-r3.3.1_0
  r-ggplot2          bioconda/osx-64::r-ggplot2-2.1.0-r3.3.1_0
  r-gtable           pkgs/r/osx-64::r-gtable-0.2.0-r3.3.1_0
  r-kernsmooth       pkgs/r/osx-64::r-kernsmooth-2.23_15-r3.3.1_0
  r-labeling         pkgs/r/osx-64::r-labeling-0.3-r3.3.1_2
  r-lattice          pkgs/r/osx-64::r-lattice-0.20_33-r3.3.1_0
  r-magrittr         pkgs/r/osx-64::r-magrittr-1.5-r3.3.1_2
  r-mass             pkgs/r/osx-64::r-mass-7.3_45-r3.3.1_0
  r-matrix           pkgs/r/osx-64::r-matrix-1.2_6-r3.3.1_0
  r-mgcv             pkgs/r/osx-64::r-mgcv-1.8_12-r3.3.1_0
  r-munsell          pkgs/r/osx-64::r-munsell-0.4.3-r3.3.1_0
  r-nlme             pkgs/r/osx-64::r-nlme-3.1_128-r3.3.1_0
  r-nnet             pkgs/r/osx-64::r-nnet-7.3_12-r3.3.1_0
  r-plyr             pkgs/r/osx-64::r-plyr-1.8.4-r3.3.1_0
  r-rcolorbrewer     pkgs/r/osx-64::r-rcolorbrewer-1.1_2-r3.3.1_3
  r-rcpp             pkgs/r/osx-64::r-rcpp-0.12.5-r3.3.1_0
  r-recommended      pkgs/r/osx-64::r-recommended-3.3.1-r3.3.1_0
  r-reshape2         pkgs/r/osx-64::r-reshape2-1.4.1-r3.3.1_2
  r-rpart            pkgs/r/osx-64::r-rpart-4.1_10-r3.3.1_0
  r-scales           bioconda/osx-64::r-scales-0.4.1-r3.3.1_1
  r-spatial          pkgs/r/osx-64::r-spatial-7.3_11-r3.3.1_0
  r-stringi          pkgs/r/osx-64::r-stringi-1.1.1-r3.3.1_0
  r-stringr          bioconda/osx-64::r-stringr-1.1.0-r3.3.1_0
  r-survival         pkgs/r/osx-64::r-survival-2.39_4-r3.3.1_0
  zstd               pkgs/main/osx-64::zstd-1.3.7-h5bba6e5_0

Proceed ([y]/n)? n

CondaSystemExit: Exiting.

Instead I started an R shell, and did this, which was a smooth sail and solved that complaint:

install.packages("BiocManager")
BiocManager::install("qvalue")

And then this is what happened, which is the error you are stuck with :)

Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Category ....................................................: light
Functional annotation source ................................: COG_FUNCTION
Exclude ungrouped ...........................................: False
Functional occurrence summary ...............................: /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp0w795f9d

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:
              /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpzjniu0xx

(base) (anvio-master) meren ~/Downloads/TEST-PACKAGE $ cat /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpzjniu0xx
# DATE: 04 Oct 19 10:21:09
# CMD LINE: anvi-run-enrichment-analysis.R --input /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp0w795f9d --output Functional_enrichment_2_groups.txt
Parsed with column specification:
cols(
  COG_FUNCTION = col_character(),
  function_accession = col_character(),
  gene_clusters_ids = col_character(),
  associated_groups = col_character(),
  p_HL = col_double(),
  p_LL = col_double(),
  N_HL = col_double(),
  N_LL = col_double()
)
Error: Column `function_accession` can't be modified because it's a grouping variable
Execution halted

So it is reproducible!

ShaiberAlon commented 5 years ago

Thank you @meren!

@adw96 , it is officially not only me.... HELP??

meren commented 5 years ago

HELP??

I read it like this:

Poor @adw96. She has a million things to do. DON'T WORRY, AMY, WE WILL BE FINE :')))))

mooreryan commented 5 years ago

Would it be possible to post just the input file used for this command?

anvi-run-enrichment-analysis.R --input /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp0w795f9d --output Functional_enrichment_2_groups.txt

ShaiberAlon commented 5 years ago

Hi @mooreryan ,

Here is the input: functional_enrichment_input_2_groups.txt

We now merged to master, so if you are on master, you can do:

anvi-run-enrichment-analysis.R --input functional_enrichment_input_2_groups.txt --output output-file

Thank you!

mooreryan commented 5 years ago

Interesting...so I just tried running the command with the data you sent and it ran without error.

meren commented 5 years ago

It runs everywhere except our lab. It is time, guys.

[person.fire() for person in http://merenlab.org/people]

mooreryan commented 5 years ago

It may be something weird in your R dependencies...could you list the packages and the versions that are currently loaded?

Here is a little Rscript that you can run which will load the same packages as anvi-run-enrichment-analysis.R and then tell you which versions of all the packages that it loads: https://gist.github.com/mooreryan/b0bbb7388c14324b7b2fed2612a7a362

meren commented 5 years ago

Thank you, @mooreryan :) Here is the output:

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   1.0.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract

"Package","Version"
"qvalue","2.16.0"
"magrittr","1.5"
"forcats","0.4.0"
"stringr","1.4.0"
"dplyr","0.8.3"
"purrr","0.3.2"
"readr","1.3.1"
"tidyr","1.0.0"
"tibble","2.1.3"
"ggplot2","3.2.1"
"tidyverse","1.2.1"
"optparse","1.6.4"
"stats","3.6.1"
"graphics","3.6.1"
"grDevices","3.6.1"
"utils","3.6.1"
"datasets","3.6.1"
"methods","3.6.1"
"base","3.6.1"

mooreryan commented 5 years ago

You do have a couple of different libs than I do.

dplyr: You have 0.8.3, I have 0.8.1. Inside a running Docker container, I upgraded to 0.8.3, but it still worked for me, so it probably isn't dplyr.
tidyr: You have 1.0.0, I have 0.8.3. When I upgraded to 1.0.0, the script broke and I got the same error as you all did!

Will continue checking the others.

mooreryan commented 5 years ago

Yep, so I have discovered that the problem is caused by something in the tidyr package that has changed somewhere in between 0.8.3 to 1.0.0.

mooreryan commented 5 years ago

Okay, so I've figured out the problem. It's here: https://github.com/merenlab/anvio/blob/2c9179f037b43502901582ad7eea61f5dbfc3131/bin/anvi-run-enrichment-analysis.R#L81

If you're running version 1.0.0 of tidyr, then that line needs to change to nest_legacy %>%, but if you are running somewhere below that, it needs to be nest. That function got new syntax in the update.

meren commented 5 years ago

OMG! Awesome, @mooreryan!!! Thank you VERY much :)

I think we should switch to nest_legacy, and ask people to update their tidyr versions if we hear about this.

@ShaiberAlon, @adw96, is this agreeable?

mooreryan commented 5 years ago

Actually, I'm about to open a pull request addressing this....

meren commented 5 years ago

:+1:

meren commented 5 years ago

By the way, I tested Ryan's solution in #1249:

(anvio-master) (base) meren ~/Downloads/TEST-PACKAGE $ anvi-get-enriched-functions-per-pan-group -p PAN.db \
>                                           -g GENOMES.db \
>                                           -o Functional_enrichment_2_groups.txt \
>                                           --category-variable light \
>                                           --annotation-source COG_FUNCTION
>                                           --annotation-source COG_FUNCTION
Genomes storage .............................................: Initialized (storage hash: hash0cde9439)
Num genomes in storage ......................................: 31
Num genomes will be used ....................................: 31
Pan DB ......................................................: Initialized: PAN.db (v. 13)
Gene cluster homogeneity estimates ..........................: Functional: [NO]; Geometric: [NO]; Combined: [NO]

* Gene clusters are initialized for all 7383 gene clusters in the database.

Category ....................................................: light
Functional annotation source ................................: COG_FUNCTION
Exclude ungrouped ...........................................: False
Functional occurrence summary ...............................: /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpv_fkk210
Functional enrichment summary log file: .....................: /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmplce2sbn7
Functional enrichment summary ...............................: Functional_enrichment_2_groups.txt

🥇

adw96 commented 5 years ago

Dear @meren @ShaiberAlon @mooreryan

amy.isback()

I'm OOO for 3 days and this is what happens?! Unbelievable...

@mooreryan -- great work finding and solving this -- thank you. I didn't know nest changed between tidyr versions. I am shocked to find that nest <- nest_legacy is the tidyverse sanctioned solution, but since it is and you are an outstanding programmer I agree this is the right fix (ref #1249 ).

If we have future problems with this approach I will rethink this (for future @adw96 : look at pack and chop), but for now I think this is great. tidyr 1.0.0 was released less than a month ago, so this was a dangerous time to use a previously very stable package. MEREN I PROMISE THIS WILL NEVER HAPPEN AGAIN PLEASE DON'T FIRE ME

Amy

adw96 commented 5 years ago

TODO(Amy) Does being explicit about the order of input arguments solve ambiguities between nest and nest_legacy? Investigate.

meren commented 5 years ago

If there is anyone to fire it is always me! :)

In fact we were discussing yesterday how much we appreciate working with you, @mooreryan, @xvazquezc, and others who are willing to share their expertise with us. We are very thankful for your time.

ShaiberAlon commented 5 years ago

Re-opening this issue, because I now get an error when I run the functional enrichment on our multiple group (i.e. more than two) example from our pangenomic tutorial:

When running this:

anvi-get-enriched-functions-per-pan-group -p PROCHLORO/Prochlorococcus_Pan-PAN.db \
                                          -g PROCHLORO-GENOMES.db \
                                          --category clade\
                                          --annotation-source COG_FUNCTION \
                                          -o PROCHLORO-PAN-enriched-functions-clade.txt \
                                          --functional-occurrence-table-output PROCHLORO-functions-occurrence.txt

I get this:

Config Error: It looks like something went wrong during the functional enrichment analysis. We
              don't know what happened, but this log file could contain some clues:
              /var/folders/4n/gwkhlcx13cg04n64tybzyshr0000gn/T/tmp5lrmcx9n

And here is the aforementioned log file (/var/folders/4n/gwkhlcx13cg04n64tybzyshr0000gn/T/tmp5lrmcx9n:

# DATE: 16 Oct 19 07:36:36
# CMD LINE: anvi-script-run-functional-enrichment-stats --input /var/folders/4n/gwkhlcx13cg04n64tybzyshr0000gn/T/tmp5xfe51yd --output PROCHLORO-PAN-enriched-functions-clade.txt
Warning message:
package ‘optparse’ was built under R version 3.5.1
Warning messages:
1: package ‘ggplot2’ was built under R version 3.5.1
2: package ‘tibble’ was built under R version 3.5.1
3: package ‘tidyr’ was built under R version 3.5.1
4: package ‘readr’ was built under R version 3.5.1
5: package ‘purrr’ was built under R version 3.5.1
6: package ‘dplyr’ was built under R version 3.5.1
7: package ‘stringr’ was built under R version 3.5.1
tidyr major version >= 1.  Using nest_legacy.
Parsed with column specification:
cols(
  COG_FUNCTION = col_character(),
  function_accession = col_character(),
  gene_clusters_ids = col_character(),
  associated_groups = col_character(),
  p_LL_IV = col_double(),
  p_HL_I = col_double(),
  p_LL_III = col_double(),
  p_LL_II = col_double(),
  p_LL_I = col_double(),
  p_HL_II = col_double(),
  N_LL_IV = col_double(),
  N_HL_I = col_double(),
  N_LL_III = col_double(),
  N_LL_II = col_double(),
  N_LL_I = col_double(),
  N_HL_II = col_double()
)
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 15984 rows:
* 7993, 10657, 11989, 13321
* 7994, 10658, 11990, 13322
* 7995, 10659, 11991, 13323
* 7996, 10660, 11992, 13324
* 7997, 10661, 11993, 13325
* 7998, 10662, 11994, 13326
* 7999, 10663, 11995, 13327
* 8000, 10664, 11996, 13328
* 8001, 10665, 11997, 13329
* 8002, 10666, 11998, 13330
* 8003, 10667, 11999, 13331
* 8004, 10668, 12000, 13332
* 8005, 10669, 12001, 13333
* 8006, 10670, 12002, 13334
* 8007, 10671, 12003, 13335
* 8008, 10672, 12004, 13336
* 8009, 10673, 12005, 13337
* 8010, 10674, 12006, 13338
* 8011, 10675, 12007, 13339
* 8012, 10676, 12008, 13340
* 8013, 10677, 12009, 13341
* 8014, 10678, 12010, 13342
* 8015, 10679, 12011, 13343
* 8016, 10680, 12012, 13344
* 8017, 10681, 12013, 13345
* 8018, 10682, 12014, 13346
* 8019, 10683, 12015, 13347
* 8020, 10684, 12016, 13348
* 8021, 10685, 12017, 13349
* 8022, 10686, 12018, 13350
* 8023, 10687, 12019, 13351
* 8024, 10688, 12020,
In addition: Warning message:
Expected 2 pieces. Additional pieces discarded in 15984 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Execution halted

Here is the input file for the R script (i.e. the aforementioned /var/folders/4n/gwkhlcx13cg04n64tybzyshr0000gn/T/tmp5xfe51yd:

functional_enrichment_input_5_groups.txt

So you can run it like this:

anvi-script-run-functional-enrichment-stats --input functional_enrichment_input_5_groups.txt \
                                            --output functional_enrichment_output_5_groups.txt

Notice that this five group example is pretty bad, because one of the groups has only one member, but I also tested this with the following input:

COG_FUNCTION    function_accession  gene_clusters_ids   associated_groups   p_LL_IV p_HL_I  p_LL_III    p_LL_II p_LL_I  p_HL_II N_LL_IV N_HL_I  N_LL_III    N_LL_II N_LL_I  N_HL_II
Deoxyribose-phosphate aldolase  COG0274 GC_00001115, GC_00002224, GC_00003647, GC_00003952      1   1   1   1   1   110 20  20  10  25  17
function2   FAKE_ID GC_00001115, GC_00002224, GC_00003647, GC_00003952      1   0.1 0   0.2 1   1   10  20  20  10  25  17

(small_multi_group_example.txt)

And this:

anvi-script-run-functional-enrichment-stats --input small_multi_group_example.txt \
                                            --output small_output.txt

And it also fails in the same way.

THIS IS MY FAULT FOR NOT TESTING WITH THE MULTIPLE GROUPS AFTER THE CHANGES. SORRY!

ShaiberAlon commented 5 years ago

I found the problem, it is because the names of the groups have this format LL_I, LL_II, etc. the _ is messing up the way the R script considers the names of groups. If I remove the _ then things work, so for example:

COG_FUNCTION    function_accession  gene_clusters_ids   associated_groups   p_LLIV  p_HLI   p_LLIII p_LLII  p_LLI   p_HLII  N_LLIV  N_HLI   N_LLIII N_LLII  N_LLI   N_HLII
Deoxyribose-phosphate aldolase  COG0274 GC_00001115, GC_00002224, GC_00003647, GC_00003952      1   1   1   1   1   110 20  20  10  25  17
function2   FAKE_ID GC_00001115, GC_00002224, GC_00003647, GC_00003952      1   0.1 0   0.2 1   1   10  20  20  10  25  17

(small_multi_group_example_fixed.txt)

Run:

anvi-script-run-functional-enrichment-stats --input small_multi_group_example_fixed.txt \
                                            --output small_output.txt

Works!

So we need to fix this. The test should definitely be ok with group names having _ in them. We should also be explicit about this (for example, if we are not ok with spaces then we should mention that, and I can add a sanity check in the python part to see if names are illegal and raise a useful error.

@mooreryan , @adw96 , if one of you has a chance to take a look and fix this, I would greatly appreciate that! My R fluency is not good enough for that...

mooreryan commented 5 years ago

I see the problem. It's in this line: https://github.com/merenlab/anvio/blob/df0a36849a24f5af29a18f7f9a0495d791fe1493/sandbox/anvi-script-run-functional-enrichment-stats#L127

It's not separating the type column as it assumes a single _ separating type and group. But it looks like in the original data, group had an _ in the name.

ShaiberAlon commented 5 years ago

This indeed solved it. Thank you very much @mooreryan !

merenlab / anvio

Error when running functional enrichment #1248

Potential solution - R version

Reproducing this