soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

Not support very long chromosomes? #74

Closed zhangrengang closed 1 year ago

zhangrengang commented 1 year ago

Expected Behavior

metaeuk run normally with other genomes, but crash with a large pine genome (Pinus tabuliformis, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA784915). Do it not support the very long chromosomes:

$ head busco_3011229/genome.fasta.fai
chr1    2364278061      6       80      81
chr10   1752849333      2393831550      80      81
chr11   1650012615      4168591507      80      81
chr12   1392452741      5839229287      80      81
chr2    2317450362      7249087694      80      81
chr3    2291775479      9595506192      80      81
chr4    2192534405      11915928871     80      81
chr5    2148190925      14135869963     80      81
chr6    2107674557      16310913281     80      81
chr7    2082167746      18444933776     80      81

MetaEuk Output (for bugs)

$ metaeuk easy-predict busco_3011229/genome.fasta pep.faa tmp tmpDir --max-intron 500000 --threads 16
Create directory tmpDir
easy-predict busco_3011229/genome.fasta pep.faa tmp tmpDir --max-intron 500000 --threads 16

MMseqs Version:                                                 f9c166910e2ae85e1e77eaf3e22291505402c1a7
Substitution matrix                                             nucl:nucleotide.out,aa:blosum62.out
Add backtrace                                                   false
Alignment mode                                                  2
Alignment mode                                                  0
Allow wrapped scoring                                           false
E-value threshold                                               100
Seq. id. threshold                                              0
Min alignment length                                            0
Seq. id. mode                                                   0
Alternative alignments                                          0
Coverage threshold                                              0
Coverage mode                                                   0
Max sequence length                                             65535
Compositional bias                                              1
Max reject                                                      2147483647
Max accept                                                      2147483647
Include identical seq. id.                                      false
Preload mode                                                    0
Pseudo count a                                                  1
Pseudo count b                                                  1.5
Score bias                                                      0
Realign hits                                                    false
Realign score bias                                              -0.2
Realign max seqs                                                2147483647
Gap open cost                                                   nucl:5,aa:11
Gap extension cost                                              nucl:2,aa:1
Zdrop                                                           40
Threads                                                         16
Compressed                                                      0
Verbosity                                                       3
Seed substitution matrix                                        nucl:nucleotide.out,aa:VTML80.out
Sensitivity                                                     4
k-mer length                                                    0
k-score                                                         2147483647
Alphabet size                                                   nucl:5,aa:21
Max results per query                                           300
Split database                                                  0
Split mode                                                      2
Split memory limit                                              0
Diagonal scoring                                                true
Exact k-mer matching                                            0
Mask residues                                                   1
Mask lower case residues                                        0
Minimum diagonal score                                          15
Spaced k-mers                                                   1
Spaced k-mer pattern
Local temporary path
Rescore mode                                                    0
Remove hits by seq. id. and coverage                            false
Sort results                                                    0
Mask profile                                                    1
Profile E-value threshold                                       0.001
Global sequence weighting                                       false
Allow deletions                                                 false
Filter MSA                                                      1
Maximum seq. id. threshold                                      0.9
Minimum seq. id.                                                0
Minimum score per column                                        -20
Minimum coverage                                                0
Select N most diverse seqs                                      1000
Min codons in orf                                               15
Max codons in length                                            32734
Max orf gaps                                                    2147483647
Contig start mode                                               2
Contig end mode                                                 2
Orf start mode                                                  1
Forward frames                                                  1,2,3
Reverse frames                                                  1,2,3
Translation table                                               1
Translate orf                                                   0
Use all table starts                                            false
Offset of numeric ids                                           0
Create lookup                                                   0
Add orf stop                                                    false
Overlap between sequences                                       0
Sequence split mode                                             1
Header split mode                                               0
Chain overlapping alignments                                    0
Merge query                                                     1
Search type                                                     0
Search iterations                                               1
Start sensitivity                                               4
Search steps                                                    1
Exhaustive search mode                                          false
Filter results during exhaustive search                         0
Strand selection                                                1
LCA search mode                                                 false
Disk space limit                                                0
MPI runner
Force restart with latest tmp                                   false
Remove temporary files                                          false
maximal combined evalue of an optimal set                       0.001
minimal length ratio between combined optimal set and target    0.5
Maximal intron length                                           500000
Minimal intron length                                           15
Minimal exon length aa                                          11
Maximal overlap of exons                                        10
Gap open penalty                                                -1
Gap extend penalty                                              -1
allow same-strand overlaps                                      0
translate codons to AAs                                         0
write target key instead of accession                           0
Reverse AA Fragments                                            0

createdb busco_3011229/genome.fasta tmpDir/15420076123933152342/contigs --dbtype 2 --compressed 0 -v 3

Converting sequences

Time for merging to contigs_h: 0h 0m 0s 32ms
Time for merging to contigs: 0h 0m 0s 0ms
Database type: Nucleotide
The input files have no entry:  - busco_3011229/genome.fasta
Please check your input files. Only files in fasta/fastq[.gz|bz2] are supported
Error: contigs createdb died
elileka commented 1 year ago

Please see my comment on https://github.com/soedinglab/metaeuk/issues/77