soedinglab / spacedust

Discovery of conserved gene clusters in multiple genomes
GNU General Public License v3.0
55 stars 2 forks source link

Unable to run spacedust normally #3

Open Dx-wmc opened 1 year ago

Dx-wmc commented 1 year ago

Expected Behavior

Test and obtain the expected gene cluster.

Current Behavior

using CDS

When I use the gff file generated by prokka, it prompts "Not enough columns in GFF file" ./spacedust createsetdb *fna setDB tmpFolder --gff-dir gff.txt --gff-type CDS

When running the next command ./spacedust clustersearch setDB setDB result.tsv tmpFolder, an error occurs.

using faa

there is no error in building the database, but an error also occurs when running ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

A puzzling point

When I use the example in the current repository provided, CDS still prompts "Not enough columns in GFF file" while faa can run within a few minutes.

My gff and faa files were generated using prokka. The size of the my genomes is about 4.5M. Despite using the same command, my own data doesn't work properly.

Your Environment

I ran separately on Ubuntu and CentOS with the same command. example_data can be executed, but it fails when I try it with my own data.

spacedust Output (for bugs)

The output of the command ./spacedust clustersearch setDB setDB result.tsv tmpFolder.

clustersearch setDB setDB result.tsv tmpFolder

MMseqs Version:                         16b020301be952232d6eb2eaa2cd2ad0933d68b0
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           true
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       10
Seq. id. threshold                      0
Min alignment length                    30
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0.8
Coverage mode                           2
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 256
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa                           
Spaced k-mers                           1
Spaced k-mer pattern                    
Local temporary path                    
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner                              
Force restart with latest tmp           false
Remove temporary files                  false
Use simple best hit                     true
Include sub-optimal hits with factor    0
Alpha                                   1
Aggregation mode                        0
Filter self match                       false
Multihit P-value cutoff                 0.01
Clustering and Ordering P-value cutoff  0.01
Maximum gene gaps                       3
Minimal cluster size                    2
Cluster weighting factor                false
Database output                         true
Cluster search against profiles         false
Cluster Search Mode                     0

Create directory tmpFolder/3152204347500479419/search
search setDB setDB tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/search --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3 --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --spaced-kmer-mode 1 --rescore-mode 0 --filter-hits 0 --sort-results 0 --mask-profile 1 --e-profile 0.001 --wg 0 --allow-deletion 0 --filter-msa 1 --filter-min-enable 0 --max-seq-id 0.9 --qid '0.0' --qsc -20 --cov 0 --diff 1000 --pseudo-cnt-mode 0 --gap-pc 10 --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 0 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --add-orf-stop 0 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --chain-alignments 0 --merge-query 1 --search-type 0 --start-sens 4 --sens-steps 1 --exhaustive-search 0 --exhaustive-search-filter 0 --strand 1 --lca-search 0 --disk-space-limit 0 --force-reuse 0 --remove-tmp-files 0

prefilter setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 256 --compressed 0 -v 3 -s 5.7

Query database size: 12719 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 12719 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 12.72K 0s 65ms
Index table: Masked residues: 15234
Index table: fill
[=================================================================] 100.00% 12.72K 0s 39ms
Index statistics
Entries:          3785086
DB size:          509 MB
Avg k-mer size:   0.059142
Top 10 k-mers
    GPGGTL  64
    GQQVAR  39
    SQQSER  30
    GLGNGK  24
    SGGSLR  24
    QLGQRV  24
    LPDEFY  23
    GQQIAR  21
    GEQVAR  21
    LGNAST  20
Time for index table init: 0h 0m 0s 583ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12719
Target db start 1 to 12719
[=================================================================] 100.00% 12.72K 3s 22ms

301.207794 k-mers per position
6149 DB matches per sequence
0 overflows
55 sequences passed prefiltering per query sequence
45 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 14ms
Time for processing: 0h 0m 4s 194ms
align setDB setDB tmpFolder/3152204347500479419/search/2069484046060416119/pref_0 tmpFolder/3152204347500479419/result --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 1 --alignment-mode 2 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 30 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 2 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 256 --compressed 0 -v 3

Compute score, coverage and sequence identity
Query database size: 12719 type: Aminoacid
Target database size: 12719 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 12.72K 0s 547ms
Time for merging to result: 0h 0m 0s 15ms
459801 alignments calculated
78951 sequence pairs passed the thresholds (0.171707 of overall calculated)
6.207328 hits per query sequence
Time for processing: 0h 0m 0s 775ms
prefixid tmpFolder/3152204347500479419/result tmpFolder/3152204347500479419/result_prefixed --threads 256 -v 3

[=================================================================] 100.00% 12.72K 0s 62ms
Time for merging to result_prefixed: 0h 0m 0s 9ms
Time for processing: 0h 0m 0s 264ms
besthitbyset setDB setDB tmpFolder/3152204347500479419/result_prefixed tmpFolder/3152204347500479419/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 12.72K 0s 81ms
Time for merging to aggregate: 0h 0m 0s 11ms
Time for processing: 0h 0m 0s 316ms
mergeresultsbyset setDB_set_to_member tmpFolder/3152204347500479419/aggregate tmpFolder/3152204347500479419/aggregate_merged --threads 256 -v 3

Time for merging to aggregate_merged: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 254ms
combinehits setDB setDB tmpFolder/3152204347500479419/aggregate_merged tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419 --alpha 1 --aggregation-mode 0 --filter-self-match 0 --threads 256 --compressed 0 -v 3

[=================================================================] 100.00% 3 0s 53ms
Time for merging to matches_h: 0h 0m 0s 9ms
Time for merging to matches: 0h 0m 0s 4ms
Time for processing: 0h 0m 0s 407ms
clusterhits setDB setDB tmpFolder/3152204347500479419/matches tmpFolder/3152204347500479419/clusters --multihit-pval 0.01 --cluster-pval 0.01 --max-gene-gap 3 --cluster-size 2 --cluster-use-weight 0 --db-output 1 --alpha 1 --threads 256 --compressed 0 -v 3

Invalid query lookup record                                       ] 0.00% 1 eta -
Error: clusterhits failed
Keepingle commented 8 months ago

I meet the same problem