steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
77 stars 7 forks source link

conterminator frozen at stage rescorediagonal #17

Open tougai opened 3 years ago

tougai commented 3 years ago

hi, i am trying to test conterminator on a very simple file to start, but it freezes at rescorediagonal stage. When i use example files dna.fas and dna.mapping, everything is fine !

here is my fasta file toto.fa:

>chr1
TCATGGCTATTTTCATAAAAAATGGGGGTTGTGTGGCCATTTATCATCGACTAGAGGCTC
ATAAACCTCACCCCACATATGTTTCCTTGCCATAGATTACATTCTTGGATTTCTGGTGGA
AACCATTTCTTGCTTAAAAACTCGTACGTGTTAGCCTTCGGTATTATTGAAAATGGTCAT
TCATGGCTATTTTTCGGCAAAATGGGGGTTGTGTGGCCATTGATCGTCGACCAGAGGCTC

my mapping file toto.fa.taxidmapping: chr1 4577

my command line: conterminator dna toto.fa toto.fa.taxidmapping out tmp

and the log:

Tmp tmp folder does not exist or is not a directory.
Create dir tmp
dna toto.fa toto.fa.taxidmapping out tmp

MMseqs Version:                         570993be7f5f31ee357183c9118bf3aa75575870
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           true
Alignment mode                          3
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0.9
Min. alignment length                   100
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     1000
Compositional bias                      0
Realign hits                            false
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Gap open cost                           5
Gap extension cost                      2
Threads                                 24
Compressed                              0
Verbosity                               3
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             5.7
K-mer size                              15
K-score                                 2147483647
Alphabet size                           21
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        false
Exact k-mer matching                    1
Mask residues                           0
Mask lower case residues                0
Minimum diagonal score                  25
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            2
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile e-value threshold               0.001
Use global sequence weighting           false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Omit consensus                          false
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1
Reverse frames                          1
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Number search iterations                1
Start sensitivity                       4
Search steps                            1
Run a seq-profile search in slice mode  false
Strand selection                        2
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  true
Database type                           0
Shuffle input database                  true
Createdb mode                           0
NCBI tax dump directory
Taxonomical mapping file
Blacklisted taxa                        10239,12908,28384,81077,11632,340016,61964,48479,48510
Compare across kingdoms                 (2||2157),4751,33208,33090,(2759&&!4751&&!33208&&!33090)

createdb toto.fa tmp/6246057436143434068/sequencedb

Converting sequences

Time for merging to sequencedb_h: 0h 0m 0s 116ms
Time for merging to sequencedb: 0h 0m 0s 115ms
Database type: Nucleotide
Time for merging to sequencedb.lookup: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 438ms
Tmp tmp/6246057436143434068/createtaxdb folder does not exist or is not a directory.
Create dir tmp/6246057436143434068/createtaxdb
createtaxdb tmp/6246057436143434068/sequencedb tmp/6246057436143434068/createtaxdb --tax-mapping-file toto.fa.taxidmapping -v 3

Download taxdump.tar.gz
2021-06-17 17:21:59 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [55403423/55403423] -> "-" [1]
Database created
Remove temporary files
tmp/6246057436143434068/createtaxdb/createindex.sh: line 58: [: : integer expression expected
splitsequence tmp/6246057436143434068/sequencedb tmp/6246057436143434068/db_rev_split --max-seq-len 1000 --sequence-overlap 0 --sequence-split-mode 1 --create-lookup 0 --threads 24 --compressed 1 -v 3

Time for processing: 0h 0m 0s 37ms
kmermatcher tmp/6246057436143434068/db_rev_split tmp/6246057436143434068/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 24 --compressed 0 -v 3

kmermatcher tmp/6246057436143434068/db_rev_split tmp/6246057436143434068/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 24 --compressed 0 -v 3

Database size: 1 type: Nucleotide

Generate k-mers list for 1 split
[=================================================================] 100.00% 1 eta -

Adjusted k-mer length 24
Sort kmer 0h 0m 0s 0ms
Sort by rep. sequence 0h 0m 0s 0ms
Time for fill: 0h 0m 0s 1ms
Time for merging to pref: 0h 0m 0s 3ms
Time for processing: 0h 0m 0s 27ms
tmp/6246057436143434068/pref exists and will be overwritten.
crosstaxonfilterorf tmp/6246057436143434068/sequencedb tmp/6246057436143434068/db_rev_split_h tmp/6246057436143434068/pref tmp/6246057436143434068/pref_cross --blacklist 10239,12908,28384,81077,11632,340016,61964,48479,48510 --kingdoms (2||2157),4751,33208,33090,(2759&&!4751&&!33208&&!33090) --threads 24 -v 3

Loading NCBI taxonomy
Loading nodes file ... Done, got 2337439 nodes
Loading merged file ... Done, added 63224 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done
[=================================================================] 100.00% 1 eta -
Time for merging to pref_cross: 0h 0m 0s 41ms
Time for processing: 0h 0m 6s 156ms
rescorediagonal tmp/6246057436143434068/db_rev_split tmp/6246057436143434068/db_rev_split tmp/6246057436143434068/pref_cross tmp/6246057436143434068/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0 -a 1 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 100 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 24 --compressed 0 -v 3
frcamacho commented 3 years ago

I'm having similar issues as described by @tougai. Are there any updates or potential workaround? Thank you!

bwgoudey commented 2 years ago

I am also facing this same issue. I've attached example input files that I derived by altering the examples that came with conterminator. In particular, I have appended the "Human-real1" sequence to the "Virus" sequence. However, this also does not make it past the rescorediagonal command. It doesn't freeze but seems to get stuck processing as there is significant CPU activity at this point. example.fas.txt example.mapping.txt

tararickman commented 2 years ago

I am having a similar issue. The command I execute, nohup conterminator dna Ino_0_SCF9.fasta dna.mapping results tmp --threads 30 & and the content of nohup.out is:

Tmp tmp folder does not exist or is not a directory.
createdb Ino_0_SCF9.fasta tmp/17425683134074217680/sequencedb

Converting sequences
[
Time for merging to sequencedb_h: 0h 0m 0s 14ms
Time for merging to sequencedb: 0h 0m 0s 8ms
Database type: Nucleotide
Time for merging to sequencedb.lookup: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 99ms
Tmp tmp/17425683134074217680/createtaxdb folder does not exist or is not a directory.
Download taxdump.tar.gz
2022-03-04 16:48:14 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [57450018/57450018] -> "-" [1]
Database created
Remove temporary files
tmp/17425683134074217680/createtaxdb/createindex.sh: line 58: [: : integer expression expected
splitsequence tmp/17425683134074217680/sequencedb tmp/17425683134074217680/db_rev_split --max-seq-len 1000 --sequence-overlap 0 --sequence-split-mode 1 --create-lookup 0 --threads 30 --compressed 1 -v 3

Sequence split mode (--sequence-split-mode 0) and compressed (--compressed 1) can not be combined.
Turn compressed to 0[=================================================================] 1 0s 1ms
Time for merging to db_rev_split_h: 0h 0m 0s 2ms
Time for merging to db_rev_split: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 13ms
kmermatcher tmp/17425683134074217680/db_rev_split tmp/17425683134074217680/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 30 --compressed 0 -v 3

kmermatcher tmp/17425683134074217680/db_rev_split tmp/17425683134074217680/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 30 --compressed 0 -v 3

Database size: 17160 type: Nucleotide

Generate k-mers list for 1 split
[=================================================================] 17.16K 0s 47ms

Adjusted k-mer length 24
Sort kmer 0h 0m 0s 54ms
Sort by rep. sequence 0h 0m 0s 31ms
Time for fill: 0h 0m 0s 21ms
Time for merging to pref: 0h 0m 0s 2ms
Time for processing: 0h 0m 0s 189ms
tmp/17425683134074217680/pref exists and will be overwritten.
crosstaxonfilterorf tmp/17425683134074217680/sequencedb tmp/17425683134074217680/db_rev_split_h tmp/17425683134074217680/pref tmp/17425683134074217680/pref_cross --blacklist 10239,12908,28384,81077,11632,340016,61964,48479,48510 --kingdoms (2||2157),4751,33208,33090,(2759&&!4751&&!33208&&!33090) --threads 30 -v 3

Loading NCBI taxonomy
Loading nodes file ... Done, got 2404460 nodes
Loading merged file ... Done, added 66368 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done
[=================================================================] 17.16K 0s 16ms
Time for merging to pref_cross: 0h 0m 0s 37ms
Time for processing: 0h 0m 4s 86ms

The job has been running on 30 threads for about 60 hours now. When I check the process ID:

trickman  960071 2984  0.0 4144048 52420 ?       Rl   Mar04 119592:58 conterminator rescorediagonal tmp/17425683134074217680/db_rev_split tmp/17425683134074217680/db_rev_split tmp/17425683134074217680/pref_cross tmp/17425683134074217680/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0 -a 1 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 100 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 30 --compressed 0 -v 3

I can see the step being hung up is rescorediagonal. Are there any solutions or advice to avoid this?

Chenglin20170390 commented 2 years ago

same issue....

lucapandolfini commented 1 year ago

same

martin-steinegger commented 1 year ago

This should be fixed now. I updated conterminator to the newest version of MMseqs2, which should resolve the issue.

bwgoudey commented 1 year ago

Thanks Martin for looking into this. Unfortunately, when I run both the example that I gave as well as the example that @tougai gave, I receive an error "Error: rescorediagonal step died". I've attached logs of outputs and stderr err_log.txt out_log.txt

aebaci commented 6 months ago

Hi. I seem to have the same issue. I'm running conterminator Version: 1.c74b5 in a few eukaryote assemblies (one in each run). And all of them are stuck in the rescorediagonal. I couldn't find a log file but the screenshot of what I can see (it's running in a screen, and annoyingly only lets me see a bit) is the following. This one in particular has been running for more than a week now. Is there anything to do? If I stop it now, can I restart it from the same step (or the following step, if staying in the rescorediagonal is a glitch?). Thanks.

[=================================================================] 100.00% 1.76K 0s 26ms 19 eta 0s Time for merging to db_rev_split_h: 0h 0m 0s 93ms Time for merging to db_rev_split: 0h 0m 0s 97ms Time for processing: 0h 0m 0s 464ms kmermatcher tmp/6530182093867110841/db_rev_split tmp/6530182093867110841/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 128 --compressed 0 -v 3

kmermatcher tmp/6530182093867110841/db_rev_split tmp/6530182093867110841/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 128 --compressed 0 -v 3

Database size: 543723 type: Nucleotide

Generate k-mers list for 1 split [=================================================================] 100.00% 543.72K 0s 683ms

Adjusted k-mer length 24 Sort kmer 0h 0m 1s 61ms Sort by rep. sequence 0h 0m 0s 232ms Time for fill: 0h 0m 0s 198ms Time for merging to pref: 0h 0m 0s 165ms Time for processing: 0h 0m 3s 69ms tmp/6530182093867110841/pref exists and will be overwritten. crosstaxonfilterorf tmp/6530182093867110841/sequencedb tmp/6530182093867110841/db_rev_split_h tmp/6530182093867110841/pref tmp/6530182093867110841/pref_cross --blacklist 10239,12908,28384,81077,11632,340016,61964,48479,48510 --kingdoms (2||2157),4751,33208,33090,(2759&&!4751&&!33208&&!33090) --threads 128 -v 3

Loading NCBI taxonomy Loading nodes file ... Done, got 2550769 nodes Loading merged file ... Done, added 75874 merged nodes. Loading names file ... Done Making matrix ... Done Init RMQ ...Done [=================================================================] 100.00% 543.72K 0s 404ms Time for merging to pref_cross: 0h 0m 0s 31ms Time for processing: 0h 0m 4s 846ms rescorediagonal tmp/6530182093867110841/db_rev_split tmp/6530182093867110841/db_rev_split tmp/6530182093867110841/pref_cross tmp/6530182093867110841/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0 -a 1 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 100 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 128 --compressed 0 -v 3