soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

Segfault with predictexons step (running easy-predict) #6

Closed domenico-simone closed 3 years ago

domenico-simone commented 4 years ago

Expected Behavior

Running the whole easy-predict pipeline.

Current Behavior

Segmentation fault error during step predictexons. It happens by running the tools either with pre-compiled versions, compiled version, singularity container.

Steps to Reproduce (for bugs)

mkdir -p /tmp/dK

metaeuk easy-predict assembly/megahit/P4_peat/euk.fa data/annotations_ref/Pfam-A.full.gz annotations/contigs/metaeuk/P4_peat.fas /tmp/dK --slice-search

MetaEuk Output (for bugs)

$ metaeuk easy-predict assembly/megahit/P4_peat/euk.fa data/annotations_ref/Pfam-A.full.gz annotations/contigs/metaeuk/P4_peat.fas /tmp/dK --slice-search
easy-predict assembly/megahit/P4_peat/euk.fa data/annotations_ref/Pfam-A.full.gz annotations/contigs/metaeuk/P4_peat.fas /tmp/dK --slice-search

MMseqs Version:                                                 4a7064d728f7a31df7221102078e206920c7d822-MPI
Substitution matrix                                             nucl:nucleotide.out,aa:blosum62.out
Add backtrace                                                   false
Alignment mode                                                  2
Allow wrapped scoring                                           false
E-value threshold                                               100
Seq. id. threshold                                              0
Min. alignment length                                           0
Seq. id. mode                                                   0
Alternative alignments                                          0
Coverage threshold                                              0
Coverage mode                                                   0
Max sequence length                                             65535
Compositional bias                                              1
Realign hits                                                    false
Max reject                                                      2147483647
Max accept                                                      2147483647
Include identical seq. id.                                      false
Preload mode                                                    0
Pseudo count a                                                  1
Pseudo count b                                                  1.5
Score bias                                                      0
Gap open cost                                                   11
Gap extension cost                                              1
Threads                                                         20
Compressed                                                      0
Verbosity                                                       3
Seed substitution matrix                                        nucl:nucleotide.out,aa:VTML80.out
Sensitivity                                                     4
K-mer size                                                      0
K-score                                                         2147483647
Alphabet size                                                   21
Max results per query                                           300
Split database                                                  0
Split mode                                                      2
Split memory limit                                              0
Diagonal scoring                                                true
Exact k-mer matching                                            0
Mask residues                                                   1
Mask lower case residues                                        0
Minimum diagonal score                                          15
Spaced k-mers                                                   1
Spaced k-mer pattern
Local temporary path
Rescore mode                                                    0
Remove hits by seq. id. and coverage                            false
Sort results                                                    0
Mask profile                                                    1
Profile e-value threshold                                       0.001
Use global sequence weighting                                   false
Allow deletions                                                 false
Filter MSA                                                      1
Maximum seq. id. threshold                                      0.9
Minimum seq. id.                                                0
Minimum score per column                                        -20
Minimum coverage                                                0
Select N most diverse seqs                                      1000
Omit consensus                                                  false
Min codons in orf                                               15
Max codons in length                                            32734
Max orf gaps                                                    2147483647
Contig start mode                                               2
Contig end mode                                                 2
Orf start mode                                                  1
Forward frames                                                  1,2,3
Reverse frames                                                  1,2,3
Translation table                                               1
Translate orf                                                   0
Use all table starts                                            false
Offset of numeric ids                                           0
Create lookup                                                   0
Add orf stop                                                    false
Chain overlapping alignments                                    0
Merge query                                                     1
Search type                                                     0
Number search iterations                                        1
Start sensitivity                                               4
Search steps                                                    1
Run a seq-profile search in slice mode                          true
Strand selection                                                1
Disk space limit                                                0
MPI runner
Force restart with latest tmp                                   true
Remove temporary files                                          false
maximal combined evalue of an optimal set                       0.001
minimal length ratio between combined optimal set and target    0.5
Maximal intron length                                           10000
Minimal intron length                                           15
Minimal exon length aa                                          11
Maximal overlap of exons                                        10
Gap open penalty                                                -1
Gap extend penalty                                              -1
allow same-strand overlaps                                      0
translate codons to AAs                                         0
write target key instead of accession                           0
Reverse AA Fragments                                            0

/tmp/dK/14029426496971440479/contigs exists and will be overwritten.
createdb assembly/megahit/P4_peat/euk.fa /tmp/dK/14029426496971440479/contigs --dbtype 2 --compressed 0 -v 3

Converting sequences
[203] 0s 8ms
Time for merging to contigs_h: 0h 0m 0s 2ms
Time for merging to contigs: 0h 0m 0s 4ms
Database type: Nucleotide
Time for merging to contigs.lookup: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 22ms
/tmp/dK/14029426496971440479/targets exists and will be overwritten.
createdb data/annotations_ref/Pfam-A.full.gz /tmp/dK/14029426496971440479/targets --dbtype 1 --compressed 0 -v 3

Converting sequences

Time for merging to targets_h: 0h 0m 0s 1ms
Time for merging to targets: 0h 2m 14s 656ms
Database type: Aminoacid
Time for merging to targets.lookup: 0h 0m 0s 0ms
Time for processing: 0h 6m 3s 717ms
predictexons /tmp/dK/14029426496971440479/contigs /tmp/dK/14029426496971440479/targets /tmp/dK/14029426496971440479/MetaEuk_calls /tmp/dK/14029426496971440479/tmp_predict --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 100 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 20 --compressed 0 -v 3 --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --omit-consensus 0 --min-length 15 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --chain-alignments 0 --merge-query 1 --search-type 0 --num-iterations 1 --start-sens 4 --sens-steps 1 --slice-search 1 --strand 1 --disk-space-limit 0 --force-reuse 1 --remove-tmp-files 0 --metaeuk-eval 0.001 --metaeuk-tcov 0.5 --max-intron 10000 --min-intron 15 --min-exon-aa 11 --max-overlap 10 --set-gap-open -1 --set-gap-extend -1 --reverse-fragments 0

search /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/aa_6f /tmp/dK/14029426496971440479/targets /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/search_res /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/tmp_search --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 100 --min-seq-id 0 --min-aln-len 11 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 20 --compressed 0 -v 3 --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --omit-consensus 0 --min-length 15 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --chain-alignments 0 --merge-query 1 --search-type 0 --num-iterations 1 --start-sens 4 --sens-steps 1 --slice-search 1 --strand 1 --disk-space-limit 0 --force-reuse 1 --remove-tmp-files 0

search /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/aa_6f /tmp/dK/14029426496971440479/targets /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/search_res /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/tmp_search --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 100 --min-seq-id 0 --min-aln-len 11 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --threads 20 --compressed 0 -v 3 --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --max-seq-id 0.9 --qid 0 --qsc -20 --cov 0 --diff 1000 --omit-consensus 0 --min-length 15 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --chain-alignments 0 --merge-query 1 --search-type 0 --num-iterations 1 --start-sens 4 --sens-steps 1 --slice-search 1 --strand 1 --disk-space-limit 0 --force-reuse 1 --remove-tmp-files 0

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              r48
  Local adapter:           mlx4_0
  Local port:              2

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   r48
  Local device: mlx4_0
--------------------------------------------------------------------------
MPI Init
Rank: 0 Size: 1
prefilter /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/tmp_search/11779388951925794484/profileDB /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/aa_6f /tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/tmp_search/11779388951925794484/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4 -k 0 --k-score 2147483647 --alph-size 21 --max-seq-len 65535 --max-seqs 2147483647 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 20 --compressed 0 -v 3

Query database size: 1 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 39917 type: Aminoacid
Index table k-mer threshold: 127 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 39.92K 0s 94ms
Index table: Masked residues: 17446
Index table: fill
[=================================================================] 100.00% 39.92K 0s 62ms
Index statistics
Entries:          1571631
DB size:          497 MB
Avg k-mer size:   0.024557
Top 10 k-mers
    RRRRRR      13
    RRRRPR      9
    RRARRR      9
    RLRRRR      9
    ASRRRR      9
    RSPSRR      8
    SSRSRS      8
    RRRRSS      8
    RSLRRR      7
    PRPRRR      7
Time for index table init: 0h 0m 0s 817ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 127
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 39917
[=================================================================] 100.00% 1 eta -
[r48:31851] *** Process received signal ***
[r48:31851] Signal: Segmentation fault (11)
[r48:31851] Signal code: Address not mapped (1)
[r48:31851] Failing at address: 0xfeb000
[r48:31851] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2b7d18fdd5f0]
[r48:31851] [ 1] metaeuk(_ZN18SubstitutionMatrix25calcLocalAaBiasCorrectionEPK10BaseMatrixPKhiPf+0x101)[0x4e9221]
[r48:31851] [ 2] metaeuk(_ZN12QueryMatcher10matchQueryEP8Sequencej+0x2bf)[0x549d4f]
[r48:31851] [ 3] metaeuk[0x535ed5]
[r48:31851] [ 4] metaeuk(_ZN12Prefiltering8runSplitERKSsS1_mb+0x679)[0x539029]
[r48:31851] [ 5] metaeuk(_ZN12Prefiltering9runSplitsERKSsS1_mmb+0x9a6)[0x53a246]
[r48:31851] [ 6] metaeuk(_ZN12Prefiltering12runMpiSplitsERKSsS1_S1_+0x326)[0x53ab76]
[r48:31851] [ 7] metaeuk(_Z9prefilteriPPKcRK7Command+0x340)[0x531850]
[r48:31851] [ 8] metaeuk(_Z10runCommandP7CommandiPPKc+0x39)[0x492c59]
[r48:31851] [ 9] metaeuk(main+0x570)[0x443650]
[r48:31851] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b7d1920c505]
[r48:31851] [11] metaeuk[0x4533ff]
[r48:31851] *** End of error message ***
/tmp/dK/14029426496971440479/tmp_predict/1327612393427599754/tmp_search/11779388951925794484/searchslicedtargetprofile.sh: line 173: 31851 Segmentation fault      (core dumped) ${RUNNER} "$MMSEQS" prefilter "${PROFILEDB}" "${INPUT}" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: prefilter died
Error: search step died
Error: predictexons step died

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Thanks,

Domenico

milot-mirdita commented 4 years ago

The Pfam profiles need to be in the correct MetaEuk(/MMseqs2) format.

We just updated MetaEuk to use latest MMseqs2 framework version. With it the databases module is exposed that allows you to easily download and prepare some common databases.

The following should work:

metaeuk databases Pfam-A.full pfam tmp
metaeuk easy-predict contigs.fa pfam out.fa tmp