Clustering error due to invalid database read

adriaaula commented 5 years ago

Expected Behavior

After a plass assembly, the clustering of the various coassemblies.

mmseqs linclust -c 0.9 --min-seq-id 0.95 --cluster-mode 0 \
                                      ${output}/plass_assembly_noclust \
                                      ${output}/plass_ass_clust \
                                      $scratch

Current Behavior

Raises an error stating the following:

Invalid database read for database data file=data/02_geneclust/plass_assembly_noclust, database index=data/02_geneclust/plass_assembly_noclust.index
Size of data: 132820104774
Requested offset: 6192668722089269
Invalid database read for database data file=data/02_geneclust/plass_assembly_noclust, database index=data/02_geneclust/plass_assembly_noclust.index
Size of data: 132820104774
Requested offset: 13511065749901201
Invalid database read for database data file=data/02_geneclust/plass_assembly_noclust, database index=data/02_geneclust/plass_assembly_noclust.index

Steps to Reproduce (for bugs)

Initial creation of the database with createdb:

mmseqs createdb data/plass_assembly/whole_assembly.faa \
                ${output}/plass_assembly_noclust

MMseqs Output (for bugs)

Error raised: https://gist.github.com/adriaaula/92582186d55e5a8575420c6e8f8ce7e1

General output: https://gist.github.com/adriaaula/ed7062dffd5f7ca2f498bed2fa3217db

Context

Clustering of a big fasta with 4 independent coassemblies of ~8 samples each. 250 Gb of RAM and 48 cpus.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used: d36dea
Which MMseqs version was used: statically compiled (Statically-compiled, self-compiled, Homebrew, etc.): statically compiled
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):AVX2 supported
Operating system and version:

milot-mirdita commented 5 years ago

Could you run something like this to check if the sequence database is in someway unusual:

awk 'BEGIN { min = 2^32; } $3 < min { min = $3 }; $3 > max { max = $3 } { sum = sum + $3; n = n + 1; } END { print sum/n,min,max;  }' ${output}/plass_assembly_noclust.index

We might have an issue with our split logic for databases larger than the available system memory.

Edit: Do the same for the offset column too please:

awk 'BEGIN { min = 2^32; } $2 < min { min = $2 }; $2 > max { max = $2 } { sum = sum + $2; n = n + 1; } END { print sum/n,min,max;  }' ${output}/plass_assembly_noclust.index

milot-mirdita commented 5 years ago

I found the issue, our split logic is indeed faulty and corrupting unrelated memory. We'll discuss the issue and try to resolve it.

adriaaula commented 5 years ago

Thank you for the fast response! With the last version of mmseqs2 (7-4e23d) I was able to process this same dataset with the only difference being --cluster-mode 2 instead of --cluster-mode 0.

Maybe this helps solving the problem. Have a good weekend! I am Sending the awk output monday

milot-mirdita commented 5 years ago

Thanks, I don't need the output of the awk call anymore, since I tracked down whats going wrong. It just a bit tricky to fix correctly and not waste a lot of RAM in doing so.

martin-steinegger commented 5 years ago

I have pushed one fix that might already fix this error. But I will further improve the code.

adriaaula commented 5 years ago

Thanks a lot! Tomorrow I try it out 👍

adriaaula commented 5 years ago

So far so good. The program is running smoothly and the kmermatcher step is ended without errors.

kmermatcher data/02_geneclust/plass_assembly_noclust /mnt/lustre/scratch/aauladell/9518450204626139512/pref --sub-mat blosum62.out --alph-size 13 --min-seq-id 0.95 --kmer-per-seq 21 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-m
ode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --skip-n-repeat-kmer 0 --threads 48 --compressed 0 -v 3

Database size: 823952135 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)

Estimated memory consumption 263893 MB
Generate k-mers list for 1 split
[=================================================================] 823.95M 37m 20s 931ms
Sort kmer 0h 11m 29s 726ms
Sort by rep. sequence 0h 2m 46s 392ms
Time for fill: 0h 3m 48s 1ms
Time for merging files: 0h 4m 55s 350ms
Time for processing: 1h 13m 27s 843ms

Thanks!

martin-steinegger commented 5 years ago

Thank you for check it again :)

nick-youngblut commented 5 years ago

I seem to be getting the same error when running linclust, but the error seems to occur during the rescorediagonal step. The log from my run:

rescorediagonal /tmp/inclust/genes_db /tmp/genes_db /tmp/linclust_tmp/12839115596035141496/pref /tmp/linclust_tmp/12839115596035141496/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --global-alignment 0 --db-load-mode 0 --threads 24 --compressed 0 -v 3

[Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Error: Rescore with hamming distance step died

My linclust job is the following:

mmseqs linclust   --threads 24 --min-seq-id 0.90 -c 0.9 --kmer-per-seq 50           /tmp/linclust/genes_db /tmp/linclust/clusters_db /tmp/linclust_tmp/

I'm using mmseqs2 9.d36de h76f5088_0 bioconda

martin-steinegger commented 5 years ago

@nick-youngblut could you please try to run your example on the newest release?

nick-youngblut commented 5 years ago

Yep, that fixed the problem. Sorry to bug you about old versions of mmseq2!

nick-youngblut commented 5 years ago

At least for one dataset that I'm working on, I seem to still be getting this error even though I'm using MMseqs2 Version: 10.6d92c:

rescorediagonal /tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db /tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db /tmp/global/nyoungblut/LLMGAG_274879977417/linclust_tmp//7142606106325398954/pref /tmp/global/nyoungblut/LLMGAG_274879977417/linclust_tmp//7142606106325398954/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 24 --compressed 0 -v 3

[Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Segmentation fault
Error: Rescore with hamming distance step died

For other metagenome datasets, this version of mmseqs2 seems to be working fine.

milot-mirdita commented 5 years ago

Could you post the full log please? This error invalid read error usually means some step before went wrong.

nick-youngblut commented 5 years ago

That is the full log. The command that I ran was:

mmseqs linclust   --threads 24 --min-seq-id 0.90 -c 0.9 --kmer-per-seq 50           /tmp/global/nyoungblut/genes_db /tmp/global/nyoungblut/clusters_db /tmp/global/nyoungblut/linclust_tmp/        2>> linclust.log 1>&2

Note: I shorted the paths to make the command easier to read.

My conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
bzip2                     1.0.8                h516909a_1    conda-forge
ca-certificates           2019.9.11            hecc5488_0    conda-forge
curl                      7.65.3               hf8cf82a_0    conda-forge
gawk                      5.0.1                h516909a_0    conda-forge
htslib                    1.9                  h4da6232_3    bioconda
krb5                      1.16.3            h05b26f9_1001    conda-forge
libcurl                   7.65.3               hda55be3_0    conda-forge
libdeflate                1.2                  h516909a_1    bioconda
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libgcc-ng                 9.1.0                hdf63c60_0
libssh2                   1.8.2                h22169c7_2    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
mmseqs2                   10.6d92c             h2d02072_0    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1c               h516909a_0    conda-forge
pigz                      2.3.4                         0    conda-forge
plass                     2.c7e35              h21aa3a5_1    bioconda
samtools                  1.9                 h10a08f8_12    bioconda
seqtk                     1.3                  hed695b0_2    bioconda
tk                        8.6.9             hed695b0_1003    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

martin-steinegger commented 5 years ago

@nick-youngblut I can not see the log. Maybe it did not upload?

nick-youngblut commented 5 years ago

I guess that 2> linclust.log 1>&2 didn't actually capture all of the output. I rerun the command. There's the log: mmseqs_linclust.log

milot-mirdita commented 5 years ago

Something went wrong in the first kmermatcher step that produced an invalid database. Now every time you rerun the job it will take the previous result in the tmp folder and try to run it through rescorediagonal to compute the alignment.

Can you clear out the tmp folder, rerun the job and post the log? Can't learn much from the rescorediagonal call sadly.

soedinglab / MMseqs2