soedinglab / metaeuk

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
GNU General Public License v3.0
178 stars 23 forks source link

easy-predict output #23

Closed nixoIas closed 3 years ago

nixoIas commented 3 years ago

Expected Behavior

Apologies if I'm misreading the issue here. I'm using the easy-predict command from metaEuk, and expected a predsResult.fas file with the sequence of all the predicted genes. My input command is metaeuk easy-predict sequence.fasta pfamseq predsResults tempFolder. sequence.fasta is the whole genome shotgun sequence of the fungi Encephalitozoon cuniculi, pfamseq is the pfamseq.gz file from pfam that contains the amino acid sequences of the protein database.

Current Behavior

I'm currently getting a list of files named targets.1, targets.index.1, targets.2, targets.index.2... all the way up to 31. The targets.1 files each seem to contain one long amino acid sequence without a file header. The targets.index.1 files are empty. There are also a few targets_h.index.1 files that contain three columns of numbers. Is there a way to compile the sequence of all the predicted genes into one file?

Steps to Reproduce (for bugs)

MetaEuk Output (for bugs)

Please make sure to also post the complete output of MetaEuk. You can use gist.github.com for large output. The metaeuk command output is:

Create directory tempFolder easy-predict sequence.fasta pfamseq predsResults tempFolder

MMseqs Version: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Add backtrace false Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 100 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Threads 8 Compressed 0 Verbosity 3 Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4 k-mer length 0 k-score 2147483647 Alphabet size nucl:5,aa:21 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.001 Global sequence weighting false Allow deletions false Filter MSA 1 Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Min codons in orf 15 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files false maximal combined evalue of an optimal set 0.001 minimal length ratio between combined optimal set and target 0.5 Maximal intron length 10000 Minimal intron length 15 Minimal exon length aa 11 Maximal overlap of exons 10 Gap open penalty -1 Gap extend penalty -1 allow same-strand overlaps 0 translate codons to AAs 0 write target key instead of accession 0 write fragment contig coords 0 Reverse AA Fragments 0

createdb sequence.fasta tempFolder/13880924210747699746/contigs --dbtype 2 --compressed 0 -v 3

Converting sequences

Time for merging to contigs_h: 0h 0m 0s 4ms Time for merging to contigs: 0h 0m 0s 3ms Database type: Nucleotide Time for processing: 0h 0m 0s 22ms createdb pfamseq tempFolder/13880924210747699746/targets --dbtype 1 --compressed 0 -v 3

Converting sequences Can not write to data file tempFolder/13880924210747699746/targets.12 Error: targets createdb died

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 3 years ago

Could you check if you have enough space left? The pfamseq.gz seems to be quite large and MetaEuk needs to have the sequences uncompressed (and also needs to store a temporary copy of the uncompressed sequences before creating the final database).

nixoIas commented 3 years ago

That was it. Thank you!

ys117vt commented 3 years ago

Hey @milot-mirdita , is it also true for other reference databases to use uncompressed version for MetaEuk runs? Thanks!