steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
79 stars 7 forks source link

Error: crosstaxonfilterorf step died #8

Closed vtrinca closed 4 years ago

vtrinca commented 4 years ago

The command: conterminator dna Trinity.fasta nt.fna.taxidmapping trinity.results tmp

dies at crosstaxonfilterorf step

tmp/2908263996980697262/conterminatordna.sh: line 59: 10751 Segmentation fault (core dumped) $RUNNER "$MMSEQS" crosstaxonfilterorf "$TMP_PATH/sequencedb" "$TMP_PATH/db_rev_split_h" "$TMP_PATH/pref" "$TMP_PATH/pref_cross" ${CROSSTAXONFILTERORF_PAR} Error: crosstaxonfilterorf step died

Although the files mentioned in the output are present in the tmp/ folder

martin-steinegger commented 4 years ago

@vtrinca thank you for reporting this issue. Could you please attach the whole log please? Is it possible to share the input with me?

vtrinca commented 4 years ago

Thanks for replying log:

Tmp tmp folder does not exist or is not a directory.
Create dir tmp
dna Trinity.fasta nt.fna.taxidmapping trinity.results tmp --threads 8 

MMseqs Version:                         3eabfaff83bb77eac5ef342e8905cc4f7d378cb7
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           true
Alignment mode                          3
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0.9
Min. alignment length                   100
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     1000
Compositional bias                      0
Realign hits                            false
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Gap open cost                           5
Gap extension cost                      2
Threads                                 8
Compressed                              0
Verbosity                               3
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             5.7
K-mer size                              15
K-score                                 2147483647
Alphabet size                           21
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        false
Exact k-mer matching                    1
Mask residues                           0
Mask lower case residues                0
Minimum diagonal score                  25
Spaced k-mers                           1
Spaced k-mer pattern                    
Local temporary path                    
Rescore mode                            2
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile e-value threshold               0.001
Use global sequence weighting           false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Omit consensus                          false
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1
Reverse frames                          1
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Number search iterations                1
Start sensitivity                       4
Search steps                            1
Run a seq-profile search in slice mode  false
Strand selection                        2
Disk space limit                        0
MPI runner                              
Force restart with latest tmp           false
Remove temporary files                  true
Database type                           0
Shuffle input database                  true
Createdb mode                           0
NCBI tax dump directory                 
Taxonomical mapping file                
Blacklisted taxa                        10239,12908,28384,81077,11632,340016,61964,48479,48510
Compare across kingdoms                 (2||2157),4751,33208,33090,(2759&&!4751&createdb Trinity.fasta tmp/2908263996980697262/sequencedb 

Converting sequences
[=============
Time for merging to sequencedb_h: 0h 0m 0s 92ms
Time for merging to sequencedb: 0h 0m 0s 174ms
Database type: Nucleotide
Time for merging to sequencedb.lookup: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 450ms
Tmp tmp/2908263996980697262/createtaxdb folder does not exist or is not a directory.
Download taxdump.tar.gz
2020-05-14 09:29:01 URL:https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [52082793/52082793] -> "-" [1]
Database created
Remove temporary files
tmp/2908263996980697262/createtaxdb/createindex.sh: line 58: [: : integer expression expected
splitsequence tmp/2908263996980697262/sequencedb tmp/2908263996980697262/db_rev_split --max-seq-len 1000 --sequence-overlap 0 --sequence-split-mode 1 --create-lookup 0 --threads 8 --compressed 1 -v 3 

Sequence split mode (--sequence-split-mode 0) and compressed (--compressed 1) can not be combined.
Turn compressed to 0[=================================================================] 131.69K 0s 34ms
Time for merging to db_rev_split_h: 0h 0m 0s 56ms
Time for merging to db_rev_split: 0h 0m 0s 54ms
Time for processing: 0h 0m 0s 251ms
kmermatcher tmp/2908263996980697262/db_rev_split tmp/2908263996980697262/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 8 --compressed 0 -v 3 

kmermatcher tmp/2908263996980697262/db_rev_split tmp/2908263996980697262/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 21 --min-seq-id 0.9 --kmer-per-seq 100 --spaced-kmer-mode 1 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 24 -c 0 --max-seq-len 1000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 8 --compressed 0 -v 3 

Database size: 189836 type: Nucleotide

Generate k-mers list for 1 split
[=================================================================] 189.84K 1s 389ms

Adjusted k-mer length 24
Sort kmer 0h 0m 1s 148ms
Sort by rep. sequence 0h 0m 0s 691ms
Time for fill: 0h 0m 0s 140ms
Time for merging to pref: 0h 0m 0s 51ms
Time for processing: 0h 0m 3s 917ms
tmp/2908263996980697262/pref exists and will be overwritten.
tmp/2908263996980697262/conterminatordna.sh: line 59: 32478 Segmentation fault      (core dumped) $RUNNER "$MMSEQS" crosstaxonfilterorf "$TMP_PATH/sequencedb" "$TMP_PATH/db_rev_split_h" "$TMP_PATH/pref" "$TMP_PATH/pref_cross" ${CROSSTAXONFILTERORF_PAR}
Error: crosstaxonfilterorf step died

log_output.txt trinity_head20.txt

martin-steinegger commented 4 years ago

@vtrinca Thank you! Could you please check if a subset of the input causes this error? If yes can you please attach the fasta and mapping?

vtrinca commented 4 years ago

Hi Martin, same error! Trinity.txt output.txt

About the mapping file, I used the same command as the README file. The file is too big for attach here. I send the first 200 lines. nt.txt

blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping

martin-steinegger commented 4 years ago

Ah, I think the issue that you need to taxonomically label your Trinity identifier. Using the mapping from the nt database will not work. It was just an example to demonstrate how to compare the nt database against itself. The following command extracts for each nt entry the sequence and taxonomical identifier.

blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping

So you need to create an own mapping file that assigns each entry to a NCBI taxonomical identifier. The mapping file should contain fasta header and taxonomical identifier. Example:

TRINITY_DN20629_c0_g1_i1    9606
TRINITY_DN20629_c0_g2_i1    562
vtrinca commented 4 years ago

Now that I made my own mapping file, the conterminator is working. Thank you for the attention.

aaronphillips7493 commented 4 years ago

Hello, I have the same error because I did the same thing as vtrinca. I am so happy to hear there is an answer to the question! However, I am unsure how to create my own mapping file. Can you please shed some light on how this is achieved?

Thanks, Aaron :)

martin-steinegger commented 4 years ago

@aaronphillips7493 could you please explain your use case?

aaronphillips7493 commented 4 years ago

I am trying to detect contamination (bacteria, arthropods, fungi) in a plant genome assembly that I have recently finished. Do you need more info?

gbdias commented 4 years ago

Hi @aaronphillips7493,

cat mygenome.fa nt.fna > mydb.fa
cat mygenome.fa.taxidmapping nt.fna.taxidmapping > mydb.fa.taxidmapping
keishaboateng97 commented 9 months ago

Hey, I am also running into the same error. I have the necessary input files, which I have made myself. And they work with NCBI (I have checked). The command does not work on a subset either. What can be done about this error?