Closed adriaaula closed 5 years ago
Could you run something like this to check if the sequence database is in someway unusual:
awk 'BEGIN { min = 2^32; } $3 < min { min = $3 }; $3 > max { max = $3 } { sum = sum + $3; n = n + 1; } END { print sum/n,min,max; }' ${output}/plass_assembly_noclust.index
We might have an issue with our split logic for databases larger than the available system memory.
Edit: Do the same for the offset column too please:
awk 'BEGIN { min = 2^32; } $2 < min { min = $2 }; $2 > max { max = $2 } { sum = sum + $2; n = n + 1; } END { print sum/n,min,max; }' ${output}/plass_assembly_noclust.index
I found the issue, our split logic is indeed faulty and corrupting unrelated memory. We'll discuss the issue and try to resolve it.
Thank you for the fast response!
With the last version of mmseqs2 (7-4e23d) I was able to process this same dataset with the only difference being --cluster-mode 2
instead of --cluster-mode 0
.
Maybe this helps solving the problem.
Have a good weekend! I am Sending the awk
output monday
Thanks, I don't need the output of the awk call anymore, since I tracked down whats going wrong. It just a bit tricky to fix correctly and not waste a lot of RAM in doing so.
I have pushed one fix that might already fix this error. But I will further improve the code.
Thanks a lot! Tomorrow I try it out 👍
So far so good. The program is running smoothly and the kmermatcher step is ended without errors.
kmermatcher data/02_geneclust/plass_assembly_noclust /mnt/lustre/scratch/aauladell/9518450204626139512/pref --sub-mat blosum62.out --alph-size 13 --min-seq-id 0.95 --kmer-per-seq 21 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-m
ode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendable 0 --skip-n-repeat-kmer 0 --threads 48 --compressed 0 -v 3
Database size: 823952135 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X)
Estimated memory consumption 263893 MB
Generate k-mers list for 1 split
[=================================================================] 823.95M 37m 20s 931ms
Sort kmer 0h 11m 29s 726ms
Sort by rep. sequence 0h 2m 46s 392ms
Time for fill: 0h 3m 48s 1ms
Time for merging files: 0h 4m 55s 350ms
Time for processing: 1h 13m 27s 843ms
Thanks!
Thank you for check it again :)
I seem to be getting the same error when running linclust
, but the error seems to occur during the rescorediagonal
step. The log from my run:
rescorediagonal /tmp/inclust/genes_db /tmp/genes_db /tmp/linclust_tmp/12839115596035141496/pref /tmp/linclust_tmp/12839115596035141496/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --global-alignment 0 --db-load-mode 0 --threads 24 --compressed 0 -v 3
[Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Invalid database read for database data file=/tmp/linclust/genes_db, database index=/tmp/linclust/genes_db.index
getData: local id (4294967295) >= db size (1084240)
Error: Rescore with hamming distance step died
My linclust job is the following:
mmseqs linclust --threads 24 --min-seq-id 0.90 -c 0.9 --kmer-per-seq 50 /tmp/linclust/genes_db /tmp/linclust/clusters_db /tmp/linclust_tmp/
I'm using mmseqs2 9.d36de h76f5088_0 bioconda
@nick-youngblut could you please try to run your example on the newest release?
Yep, that fixed the problem. Sorry to bug you about old versions of mmseq2!
At least for one dataset that I'm working on, I seem to still be getting this error even though I'm using MMseqs2 Version: 10.6d92c
:
rescorediagonal /tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db /tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db /tmp/global/nyoungblut/LLMGAG_274879977417/linclust_tmp//7142606106325398954/pref /tmp/global/nyoungblut/LLMGAG_274879977417/linclust_tmp//7142606106325398954/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 24 --compressed 0 -v 3
[Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Invalid database read for database data file=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db, database index=/tmp/global/nyoungblut/LLMGAG_274879977417/linclust/genes_db.index
getData: local id (4294967295) >= db size (184470)
Segmentation fault
Error: Rescore with hamming distance step died
For other metagenome datasets, this version of mmseqs2 seems to be working fine.
Could you post the full log please? This error invalid read error usually means some step before went wrong.
That is the full log. The command that I ran was:
mmseqs linclust --threads 24 --min-seq-id 0.90 -c 0.9 --kmer-per-seq 50 /tmp/global/nyoungblut/genes_db /tmp/global/nyoungblut/clusters_db /tmp/global/nyoungblut/linclust_tmp/ 2>> linclust.log 1>&2
Note: I shorted the paths to make the command easier to read.
My conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 main
bzip2 1.0.8 h516909a_1 conda-forge
ca-certificates 2019.9.11 hecc5488_0 conda-forge
curl 7.65.3 hf8cf82a_0 conda-forge
gawk 5.0.1 h516909a_0 conda-forge
htslib 1.9 h4da6232_3 bioconda
krb5 1.16.3 h05b26f9_1001 conda-forge
libcurl 7.65.3 hda55be3_0 conda-forge
libdeflate 1.2 h516909a_1 bioconda
libedit 3.1.20170329 hf8c457e_1001 conda-forge
libgcc-ng 9.1.0 hdf63c60_0
libssh2 1.8.2 h22169c7_2 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
llvm-openmp 8.0.1 hc9558a2_0 conda-forge
mmseqs2 10.6d92c h2d02072_0 bioconda
ncurses 6.1 hf484d3e_1002 conda-forge
openmp 8.0.1 0 conda-forge
openssl 1.1.1c h516909a_0 conda-forge
pigz 2.3.4 0 conda-forge
plass 2.c7e35 h21aa3a5_1 bioconda
samtools 1.9 h10a08f8_12 bioconda
seqtk 1.3 hed695b0_2 bioconda
tk 8.6.9 hed695b0_1003 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
@nick-youngblut I can not see the log. Maybe it did not upload?
I guess that 2> linclust.log 1>&2
didn't actually capture all of the output. I rerun the command. There's the log: mmseqs_linclust.log
Something went wrong in the first kmermatcher
step that produced an invalid database. Now every time you rerun the job it will take the previous result in the tmp
folder and try to run it through rescorediagonal
to compute the alignment.
Can you clear out the tmp
folder, rerun the job and post the log? Can't learn much from the rescorediagonal
call sadly.
Expected Behavior
After a plass assembly, the clustering of the various coassemblies.
Current Behavior
Raises an error stating the following:
Steps to Reproduce (for bugs)
Initial creation of the database with
createdb
:MMseqs Output (for bugs)
Error raised: https://gist.github.com/adriaaula/92582186d55e5a8575420c6e8f8ce7e1
General output: https://gist.github.com/adriaaula/ed7062dffd5f7ca2f498bed2fa3217db
Context
Clustering of a big fasta with 4 independent coassemblies of ~8 samples each. 250 Gb of RAM and 48 cpus.
Your Environment
Include as many relevant details about the environment you experienced the bug in.