soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

mmseqs taxonomy: creating self-directed symlink of _h file #188

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

When I run mmseqs taxonomy, it converts the _h file for the input sequence db from a standard file to a symlink that points at itself. So the symlink is then broken, and mmseqs taxonomy fails. I'm using a different temporary directory for mmseqs taxonomy than where the _h file is, so that shouldn't be the problem.

mmseqs version: 8.fac81

conda env

``` # Name Version Build Channel bzip2 1.0.6 h14c3975_1002 conda-forge ca-certificates 2019.3.9 hecc5488_0 conda-forge curl 7.64.1 hf8cf82a_0 conda-forge gawk 4.2.1 h14c3975_1001 conda-forge krb5 1.16.3 h05b26f9_1001 conda-forge libcurl 7.64.1 hda55be3_0 conda-forge libdeflate 1.0 h14c3975_1 bioconda libedit 3.1.20170329 hf8c457e_1001 conda-forge libgcc-ng 8.2.0 hdf63c60_1 libssh2 1.8.2 h22169c7_2 conda-forge libstdcxx-ng 8.2.0 hdf63c60_1 llvm-openmp 8.0.0 hc9558a2_0 conda-forge mmseqs2 8.fac81 hf3e9acd_1 bioconda ncurses 6.1 hf484d3e_1002 conda-forge openmp 8.0.0 0 conda-forge openssl 1.1.1b h14c3975_1 conda-forge pigz 2.3.4 0 conda-forge plass 2.c7e35 h21aa3a5_1 bioconda samtools 1.9 h8571acd_11 bioconda seqtk 1.3 h84994c4_1 bioconda tk 8.6.9 h84994c4_1001 conda-forge xz 5.2.4 h14c3975_1001 conda-forge zlib 1.2.11 h14c3975_1004 conda-forge ```

conda info

``` active environment : /ebio/abt3_projects/software/dev/llmgag/.snakemake/conda/6345f887 active env location : /ebio/abt3_projects/software/dev/llmgag/.snakemake/conda/6345f887 shell level : 2 user config file : /ebio/abt3/nyoungblut/.condarc populated config files : /ebio/abt3_projects/software/dev/miniconda3_dev/.condarc /ebio/abt3/nyoungblut/.condarc conda version : 4.6.11 conda-build version : 3.11.0 python version : 3.6.7.final.0 base environment : /ebio/abt3_projects/software/dev/miniconda3_dev (writable) channel URLs : https://conda.anaconda.org/conda-forge/linux-64 https://conda.anaconda.org/conda-forge/noarch https://conda.anaconda.org/bioconda/linux-64 https://conda.anaconda.org/bioconda/noarch https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/free/linux-64 https://repo.anaconda.com/pkgs/free/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch https://conda.anaconda.org/leylabmpi/linux-64 https://conda.anaconda.org/leylabmpi/noarch https://conda.anaconda.org/r/linux-64 https://conda.anaconda.org/r/noarch https://conda.anaconda.org/qiime2/linux-64 https://conda.anaconda.org/qiime2/noarch package cache : /ebio/abt3_projects/software/dev/miniconda3_dev/pkgs /ebio/abt3/nyoungblut/.conda/pkgs envs directories : /ebio/abt3_projects/software/dev/miniconda3_dev/envs /ebio/abt3/nyoungblut/.conda/envs platform : linux-64 user-agent : conda/4.6.11 requests/2.18.4 CPython/3.6.7 Linux/4.9.127 ubuntu/18.04.1 glibc/2.27 UID:GID : 6354:350 netrc file : None offline mode : False ```

nick-youngblut commented 5 years ago

To be clear, as far as I can tell, mmseqs taxonomy is completely unusable due to this bug. I'm surprised others have not commented on this earlier. I've reproduced this error multiple times, so it's not stochastic.

martin-steinegger commented 5 years ago

@nick-youngblut I have added a taxonomy regression test to our test suite. I could not reproduce your error. But we found a critical error, caused by multi threading ,in one modules involved in the 2bLCA search. This issues should be fixed in the main branch. Could you try to run the regression?

 git clone https://bitbucket.org/martin_steinegger/mmseqs-benchmark
 cd mmseqs-benchmark
 ./run_regression.sh mmseqs resultFolder
nick-youngblut commented 5 years ago

@martin-steinegger sorry for the delay. I ran the regression (usingmmseqs2 8.fac81 hf3e9acd_1 bioconda), and it appears that some tests failed. The end of the test output:

Tmp resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/tmp folder does not exist or is not a directory.
Created dir resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/tmp
Program call:
extractorfs resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/targetannotation_nucl resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/tmp/4434917762398107271/orfs --min-length 30 --max-length 98202 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --use-all-table-starts 0 --id-offset 0 --threads 80 --compressed 0 -v 3

No datafile could be found for resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/targetannotation_nucl_h!
Error: extractorfs died
Command exited with non-zero status 1
40.25user 1.33system 0:02.64elapsed 1570%CPU (0avgtext+0avgdata 178744maxresident)k
154744inputs+244552outputs (605major+33470minor)pagefaults 0swaps

LINSEARCH_NUCLNUCL_TARNS_SEARCH
TEST FAILED (NO REPORT)

DBPROFILE_INDEX
TEST FAILED (NO REPORT)

NUCLPROTTAX_SEARCH
TEST FAILED (NO REPORT)

PROTNUCL_SEARCH
TEST FAILED (NO REPORT)

EASY_LINCLUST
TEST SUCCESS
GOOD
Expected:  26523
Actual:  26523

LINCLUST
TEST SUCCESS
GOOD
Expected:  26523
Actual:  26523

EASY_CLUSTER
TEST SUCCESS
GOOD
Expected:  15682
Actual:  15682

CLUSTER
TEST SUCCESS
GOOD
Expected:  15682
Actual:  15682

NUCLNUCL_TRANS_SEARCH
TEST FAILED (NO REPORT)

NUCLNUCL_SEARCH
TEST FAILED (NO REPORT)

NUCLPROT_SEARCH
TEST FAILED (NO REPORT)

DBPROFILE
TEST SUCCESS
GOOD
Expected:  0.142
Actual:  0.182019

SLICEPROFILE
TEST SUCCESS
GOOD
Expected:  0.140
Actual:  0.147729

EASY_PROFILE
TEST SUCCESS
GOOD
Expected:  0.334
Actual:  0.338768

PROFILE
TEST FAILED
BAD
Expected:  0.367
Actual:  0.324652

EASY_SEARCH
TEST SUCCESS
GOOD
Expected:  0.235
Actual:  0.238355

SEARCH
TEST SUCCESS
GOOD
Expected:  0.235
Actual:  0.238355
martin-steinegger commented 5 years ago

Ah yes, the bioconda version has some known issues. We added quite a lot of testing this recent days and fixed many issues. Could you please try the most recent version? We will make a new release soon.

nick-youngblut commented 5 years ago

OK, I cloned from the master branch (MMseqs2 Version: d990a0fb4bba9193b8aadc699a614303a57792f2) and re-ran the tests. During the testing, the following warning/error kept appearing: No datafile could be found for resultFolder/NUCLPROTTAX_SEARCH/query_nucl_h!. Here's the tail of the output:

No datafile could be found for resultFolder/LINSEARCH_NUCLNUCL_TARNS_SEARCH/targetannotation_nucl_h!
Error: extractorfs died
Command exited with non-zero status 1
37.62user 1.04system 0:02.30elapsed 1676%CPU (0avgtext+0avgdata 57204maxresident)k
156904inputs+244464outputs (603major+36363minor)pagefaults 0swaps

LINSEARCH_NUCLNUCL_TARNS_SEARCH
TEST FAILED (NO REPORT)

DBPROFILE_INDEX
TEST SUCCESS
GOOD
Expected:  0.142
Actual:  0.197554

NUCLPROTTAX_SEARCH
TEST FAILED (NO REPORT)

PROTNUCL_SEARCH
TEST FAILED (NO REPORT)

EASY_LINCLUST
TEST SUCCESS
GOOD
Expected:  26523
Actual:  26523

LINCLUST
TEST SUCCESS
GOOD
Expected:  26523
Actual:  26523

EASY_CLUSTER
TEST FAILED
BAD
Expected:  15682
Actual:  15675

CLUSTER
TEST FAILED
BAD
Expected:  15682
Actual:  15675

NUCLNUCL_TRANS_SEARCH
TEST FAILED (NO REPORT)

NUCLNUCL_SEARCH
TEST FAILED (NO REPORT)

NUCLPROT_SEARCH
TEST FAILED (NO REPORT)

DBPROFILE
TEST SUCCESS
GOOD
Expected:  0.142
Actual:  0.182019

SLICEPROFILE
TEST SUCCESS
GOOD
Expected:  0.140
Actual:  0.147729

EASY_PROFILE
TEST SUCCESS
GOOD
Expected:  0.334
Actual:  0.338757

PROFILE
TEST SUCCESS
GOOD
Expected:  0.367
Actual:  0.367423

EASY_SEARCH
TEST SUCCESS
GOOD
Expected:  0.235
Actual:  0.238355

SEARCH
TEST SUCCESS
GOOD
Expected:  0.235
Actual:  0.238355
martin-steinegger commented 5 years ago

@nick-youngblut do you still encounter this self directed sym links?

nick-youngblut commented 5 years ago

@martin-steinegger I haven't encountered the problems anytime recently, but I also haven't used mmseqs2 much recently. I am planning using it more soon, so I can let you know. Is mmseqs2 updated on bioconda?