When I use the gff file generated by prokka, it prompts "Not enough columns in GFF file" ./spacedust createsetdb *fna setDB tmpFolder --gff-dir gff.txt --gff-type CDS
When running the next command ./spacedust clustersearch setDB setDB result.tsv tmpFolder, an error occurs.
using faa
there is no error in building the database, but an error also occurs when running ./spacedust clustersearch setDB setDB result.tsv tmpFolder.
A puzzling point
When I use the example in the current repository provided, CDS still prompts "Not enough columns in GFF file" while faa can run within a few minutes.
My gff and faa files were generated using prokka. The size of the my genomes is about 4.5M. Despite using the same command, my own data doesn't work properly.
Your Environment
I ran separately on Ubuntu and CentOS with the same command. example_data can be executed, but it fails when I try it with my own data.
spacedust Output (for bugs)
The output of the command ./spacedust clustersearch setDB setDB result.tsv tmpFolder.
clustersearch setDB setDB result.tsv tmpFolder
MMseqs Version: 16b020301be952232d6eb2eaa2cd2ad0933d68b0
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace true
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 10
Seq. id. threshold 0
Min alignment length 30
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.8
Coverage mode 2
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Threads 256
Compressed 0
Verbosity 3
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 5.7
k-mer length 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Gap pseudo count 10
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
Use simple best hit true
Include sub-optimal hits with factor 0
Alpha 1
Aggregation mode 0
Filter self match false
Multihit P-value cutoff 0.01
Clustering and Ordering P-value cutoff 0.01
Maximum gene gaps 3
Minimal cluster size 2
Cluster weighting factor false
Database output true
Cluster search against profiles false
Cluster Search Mode 0
Create directory tmpFolder/3152204347500479419/search
search setDB setDB tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/search --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3 --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --gap-pc 10 --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --chain-alignments 0 --merge-query 1 --search-type 0 --start-sens 4 --sens-steps 1 --exhaustive-search 0 --exhaustive-search-filter 0 --strand 1 --lca-search 0 --disk-space-limit 0 --force-reuse 0 --remove-tmp-files 0
prefilter setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 256 --compressed 0 -v 3 -s 5.7
Query database size: 12719 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 12719 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 12.72K 0s 65ms
Index table: Masked residues: 15234
Index table: fill
[=================================================================] 100.00% 12.72K 0s 39ms
Index statistics
Entries: 3785086
DB size: 509 MB
Avg k-mer size: 0.059142
Top 10 k-mers
GPGGTL 64
GQQVAR 39
SQQSER 30
GLGNGK 24
SGGSLR 24
QLGQRV 24
LPDEFY 23
GQQIAR 21
GEQVAR 21
LGNAST 20
Time for index table init: 0h 0m 0s 583ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12719
Target db start 1 to 12719
[=================================================================] 100.00% 12.72K 3s 22ms
301.207794 k-mers per position
6149 DB matches per sequence
0 overflows
55 sequences passed prefiltering per query sequence
45 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 14ms
Time for processing: 0h 0m 4s 194ms
align setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 tmpFolder/3152204347500479419/result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 12719 type: Aminoacid
Target database size: 12719 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 12.72K 0s 547ms
Time for merging to result: 0h 0m 0s 15ms
459801 alignments calculated
78951 sequence pairs passed the thresholds (0.171707 of overall calculated)
6.207328 hits per query sequence
Time for processing: 0h 0m 0s 775ms
prefixid tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/result_prefixed --threads 256 -v 3
[=================================================================] 100.00% 12.72K 0s 62ms
Time for merging to result_prefixed: 0h 0m 0s 9ms
Time for processing: 0h 0m 0s 264ms
besthitbyset setDB setDB tmpFolder/3152204347500479419/result_prefixed tmpFolder/3152204347500479419/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 256 --compressed 0 -v 3
[=================================================================] 100.00% 12.72K 0s 81ms
Time for merging to aggregate: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 316ms
mergeresultsbyset setDB_set_to_member tmpFolder/3152204347500479419/aggregate tmpFolder/3152204347500479419/aggregate_merged --threads 256 -v 3
Time for merging to aggregate_merged: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 254ms
combinehits setDB setDB tmpFolder/3152204347500479419/aggregate_merged tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419 --alpha 1 --aggregation-mode 0 --filter-self-match 0 --threads 256 --compressed 0 -v 3
[=================================================================] 100.00% 3 0s 53ms
Time for merging to matches_h: 0h 0m 0s 9ms
Time for merging to matches: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 407ms
clusterhits setDB setDB tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419/clusters --multihit-pval 0.01 --cluster-pval 0.01 --max-gene-gap 3 --cluster-size 2 --cluster-use-weight 0 --db-output 1 --alpha 1 --threads 256 --compressed 0 -v 3
Invalid query lookup record ] 0.00% 1 eta -
Error: clusterhits failed
Expected Behavior
Test and obtain the expected gene cluster.
Current Behavior
using CDS
When I use the gff file generated by prokka, it prompts "Not enough columns in GFF file"
./spacedust createsetdb *fna setDB tmpFolder --gff-dir gff.txt --gff-type CDS
When running the next command
./spacedust clustersearch setDB setDB result.tsv tmpFolder
, an error occurs.using faa
there is no error in building the database, but an error also occurs when running
./spacedust clustersearch setDB setDB result.tsv tmpFolder
.A puzzling point
When I use the example in the current repository provided, CDS still prompts "Not enough columns in GFF file" while faa can run within a few minutes.
My gff and faa files were generated using prokka. The size of the my genomes is about 4.5M. Despite using the same command, my own data doesn't work properly.
Your Environment
I ran separately on Ubuntu and CentOS with the same command. example_data can be executed, but it fails when I try it with my own data.
spacedust Output (for bugs)
The output of the command
./spacedust clustersearch setDB setDB result.tsv tmpFolder
.