merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

Issue with single sample in a category with anvi-get-enriched-functions-per-pan-group #1511

Closed JuliaMcGonigle closed 3 years ago

JuliaMcGonigle commented 3 years ago

Short description of the problem

Maybe bug? Maybe you know about it already and this is not useful to report? Feel free to ignore if so.

The "categorical" layer I used with --category-variable had one sample which was the only sample for that group and the program broke with the following error:

Config Error: It looks like something went wrong during the functional enrichment analysis. We don't know what happened, but this log file could contain some clues: /tmp/tmpwirnsjl5

With the breaking point in /tmp/tmpwirnsjl5 stating:

Error: Problem with mutate() input adjusted_q_value. ✖ ERROR: The estimated pi0 <= 0. Check that you have valid p-values or use a different range of lambda. ℹ Input adjusted_q_value is ...$NULL. Backtrace: █

  1. ├─%>%(...)
  2. │ ├─base::withVisible(eval(quote(_fseq(_lhs)), env, env))
  3. │ └─base::eval(quote(_fseq(_lhs)), env, env)
  4. │ └─base::eval(quote(_fseq(_lhs)), env, env)
  5. │ └─_fseq(_lhs)
  6. │ └─magrittr::freduce(value, _function_list)
  7. │ └─function_list[i]
  8. │ ├─dplyr::mutate(...)
  9. │ └─dplyr:::mutate.data.frame(...)
    1. │ └─dplyr:::mutate_cols(.data, ...)
    2. │ ├─base::withCallingHandlers(...)
    3. │ └─mask$eval_all_mutate(dots[[i]])
    4. ├─qvalue::qvalue(...)
    5. │ └─qvalue::pi0est(p, ...)
    6. │ └─base::stop("ERROR: The estimated pi0 <= 0 Execution halted

I looked at the "Functional occurrence summary" file created and noticed the one sample had values for p_CATEGORYNAME of only 1s and 0s for each COG accession. When I removed the category for that one sample (so it became the only ungrouped), and re-ran using the --exclude-ungrouped flag, it ran fine with no errors. The one sample in the category was causing the program to break.

anvi'o version

Anvi'o version ...............................: esther (v6.2) Profile DB version ...........................: 31 Contigs DB version ...........................: 14 Pan DB version ...............................: 13 Genome data storage version ..................: 6 Auxiliary data storage version ...............: 2 Structure DB version .........................: 1

System info

Installed anvio with anaconda3. I am using Windows 10 Version 1909 with WSL Ubuntu 20.04.1 LTS (GNU/Linux 4.4.0-18362-Microsoft x86_64)

Files to reproduce

I can share files upon request.

meren commented 3 years ago

We certainly should not allow only a single sample associated with a category instead of a cryptic error like this :) @ivagljiva, can you please try reproducing this one?

Thanks for the report!

ivagljiva commented 3 years ago

I have attempted to reproduce this and failed, with both anvio-6.2 and anvi'o main. 😖

@JuliaMcGonigle, could you perhaps take a look below at the description of my test data/results and let me know if there is something I am missing to be able to reproduce this error? Or, if you feel it would be easier, you could share your files and then I could try reproducing it with your dataset. Thank you very much for your help!

Anvi'o version

(anvio-6.2) mithrandir:~/Lab/test-anvio-6.2 iva$ which anvi-interactive
/opt/miniconda3/envs/anvio-6.2/bin/anvi-interactive
(anvio-6.2) mithrandir:~/Lab/test-anvio-6.2 iva$ anvi-interactive -v
Anvi'o version ...............................: esther (v6.2)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

Note that I also tried with the development version of anvi'o, and that also worked without errors.

R version (within anvi'o)

I thought this might be pertinent info to include since it is R code that is throwing the error.

(anvio-6.2) mithrandir:~/Lab/test-anvio-6.2 iva$ which Rscript
/opt/miniconda3/envs/anvio-6.2/bin/Rscript
(anvio-6.2) mithrandir:~/Lab/test-anvio-6.2 iva$ Rscript --version
R scripting front-end version 3.5.1 (2018-07-02)

Obtaining test data

I decided to use a subset of the Infant Gut Dataset to test this. This is how to download the datapack:

wget https://ndownloader.figshare.com/files/18046139 -O INFANT-GUT-TUTORIAL.tar.gz
tar -zxvf INFANT-GUT-TUTORIAL.tar.gz && cd INFANT-GUT-TUTORIAL/additional-files/pangenomics/

And here is how I generated a small pangenome of 5 genomes:

head -n 6 external-genomes.txt > 5-external-genomes.txt
cd ../../
anvi-gen-genomes-storage -e INFANT-GUT-TUTORIAL/additional-files/pangenomics/5-external-genomes.txt -o FIVE-GENOMES.db
anvi-pan-genome -g FIVE-GENOMES.db -n FIVE-GENOMES-TEST -o PAN --num-threads 6

I made a file to assign groups, which looks like this:

(anvio-6.2) mithrandir:~/Lab/test-anvio-6.2 iva$ cat test_groups.txt
name    group
E_faecalis_6240 B
E_faecalis_6250 B
E_faecalis_6255 A
E_faecalis_6512 B
E_faecalis_6557 B

So it will assign 4 of the genomes to group 'B' and 1 of the genomes to group 'A'.

anvi-import-misc-data -p PAN/FIVE-GENOMES-TEST-PAN.db -t layers test_groups.txt

Attempt to reproduce

anvi-get-enriched-functions-per-pan-group -p PAN/FIVE-GENOMES-TEST-PAN.db -g FIVE-GENOMES.db --category-variable group --annotation-source COG_FUNCTION -o test_enriched_func.txt --functional-occurrence-table-output test_func_occurence.txt

I got no errors with this command.

Furthermore, if I take a look at the unique values in the p_A and N_A columns of the enrichment output:

iva$ cut -f 8,10 test_enriched_func.txt | uniq
p_A N_A
0   1
1   1

So it seems that group 'A' indeed has only 0s and 1s in the p_A column, and only 1s in the N_A column.

It is very likely that I am not replicating some critical circumstance that led to this error. I hope we can find it.

JuliaMcGonigle commented 3 years ago

Hello,

Since posting this I had a Windows update painfully break my network adapter that could only be solved by a hard reset that also removed WSL and all files in those directories, which I stupidly did not backup first. So I sadly don't have all of the OG files anymore. I can give you some more info however.

For additional information both the genome storage and pan databases were created on a server with an older version on Anvio: Anvi'o version ...............................: margaret (vunknown) Profile DB version ...........................: 30 Contigs DB version ...........................: 12 Pan DB version ...............................: 12 Genome data storage version ..................: 6 Auxiliary data storage version ...............: 2 Structure DB version .........................: 1

I moved both onto my machine with the 6.2 version mentioned above and migrated the pan.db from version 12 to 13 before running anvi-import-misc-data and anvi-get-enriched-functions-per-pan-group.

I think I had a different R version than you (although hard to say as WSL was reinstalled since getting this error), currently it is: (anvio-6.2) jmcgonigle@JuliaPC:~$ which Rscript home/jmcgonigle/miniconda3/envs/anvio-6.2/bin/Rscript (anvio-6.2) jmcgonigle@JuliaPC:~$ Rscript --version R scripting front-end version 3.6.1 (2019-07-05)

The genomes storage and pangenome database files that I had initially created on the server were still there. I moved them off and tried to recreate the error for you guys but also could not. It ran just fine, as with your gut microbiome genomes. The only major difference between this run and the older one was the file I used to assign groups. I had to create a new one for this. I bet something was wrong with that OG file? Unfortunately it got deleted with the Windows reinstall. :(

The only other (minor) difference is that WSL, Ubuntu, and Anvio (and conda) were all reinstalled after the OS reset. The Anvio I am currently using was installed with miniconda3, whereas previously when getting the error I had installed using the full mega conda. WSL/Ubuntu are the same versions. Sorry I can't be more helpful here. Stupid Windows updates...

ivagljiva commented 3 years ago

Ah, that is sad to hear @JuliaMcGonigle. Stupid Windows indeed... 😞

Well, it seems we will probably not be able to reproduce the error and find out where it is coming from, so unfortunately there is not much I can do to fix it just yet. We will have to wait and see if it comes up again. I will close this issue for now, but if you (or anyone else) finds it again in the future, feel free to re-open it.

In the meantime, I think I will add some sort of warning that alerts users when they have too few samples in a category. After all, the description of this program in the pangenomics tutorial comes with the following warning:

Just remember that if some of your groups have very few genomes in them, then the statistical test will not be very reliable. The minimal number of genomes in a group for the test to be reliable depends on a number of factors, but we recommend proceeding with great caution if any of your groups have fewer than 8 genomes.

but there is nothing in the codebase to warn the user if they go down this path without realizing the consequences.