soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

no format options in easy-search #139

Closed gaboentropy closed 5 years ago

gaboentropy commented 5 years ago

--format-mode 2 --format-output query,target

Expected Behavior

They should produce output with some table format

Current Behavior

Neither works. The first dies with: Unrecognized parameter --format-mode Did you mean "--cov-mode"? Error: Search died

The second dies with: Unrecognized parameter --format-output Did you mean "--max-accept"? Error: Search died

Steps to Reproduce (for bugs)

mkdir /tmp/testMMS mmseqs easy-search GCF_000005845.faa.gz Pfam-A GCF_000005845.pfam-a.mmseqs /tmp/testMMS --comp-bias-corr 0 --alt-ali 5 --threads 1 --format-output query,target,evalue,bits,qstart,qend,qlen,tstart,tend,tlen

MMseqs Output (for bugs)

Program call: easy-search GCF_000005845.faa.gz Pfam-A GCF_000005845.pfam-a.mmseqs /tmp/testMMS --comp-bias-corr 0 --alt-ali 5 --threads 1 --format-output query,target,evalue,bits,qstart,qend,qlen,tstart,tend,tlen

MMseqs Version: 4e23d5f1d13a435c7b6c9406137ed68ce297e0fc Sub Matrix blosum62.out Add backtrace false Alignment mode 3 E-value threshold 0.001 Seq. Id Threshold 0 Seq. Id. Mode 0 Alternative alignments 5 Coverage threshold 0 Coverage Mode 0 Max. sequence length 65535 Max. results per query 300 Compositional bias 0 Realign hit false Max Reject 2147483647 Max Accept 2147483647 Include identical Seq. Id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Gap open cost 11 Gap extension cost 1 Threads 1 Verbosity 3 Sensitivity 5.7 K-mer size 0 K-score 2147483647 Alphabet size 21 Offset result 0 Split DB 0 Split mode 2 Split Memory Limit 0 Diagonal Scoring 1 Exact k-mer matching 0 Mask Residues 1 Minimum Diagonal score 15 Spaced Kmer 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq.id. and coverage false Sort results 0 In substitution scoring mode, performs global alignment along the diagonal false Mask profile 1 Profile e-value threshold 0.001 Use global sequence weighting false Filter MSA 1 Maximum sequence identity threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select n most diverse seqs 1000 Omit Consensus false Min codons in orf 1 Max codons in length 2147483647 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 0 Forward Frames 1,2,3 Reverse Frames 1,2,3 Translation Table 1 Use all table starts false Offset of numeric ids 0 Add Orf Stop false Number search iterations 1 Start sensitivity 4 Search steps 1 Run a seq-profile search in slice mode false Strand selection 1 Disk space limit 0 Sets the MPI runner
Remove Temporary Files true Alignment Format 0 Format alignment output query,target,evalue,bits,qstart,qend,qlen,tstart,tend,tlen Database Output false Overlap 0 Split Seq. by len true Do not shuffle input database true Greedy best hits false

Program call: createdb GCF_000005845.faa.gz /tmp/testMMS/1537235642484915501/query --max-seq-len 65535 --dont-split-seq-by-len 1 --dont-shuffle 1 --id-offset 0 -v 3

MMseqs Version: 4e23d5f1d13a435c7b6c9406137ed68ce297e0fc Max. sequence length 65535 Split Seq. by len true Do not shuffle input database true Offset of numeric ids 0 Verbosity 3

Time for merging files: 0h 0m 0s 0ms Time for merging files: 0h 0m 0s 0ms Touch data file /tmp/testMMS/1537235642484915501/query ... Done. Time for merging files: 0h 0m 0s 0ms Touch data file /tmp/testMMS/1537235642484915501/query_h ... Done. Time for merging files: 0h 0m 0s 0ms Time for processing: 0h 0m 0s 38ms mmseqs search: Searches with the sequences or profiles query DB through the target sequence DB by running the prefilter tool and the align tool for Smith-Waterman alignment. For each query a results file with sequence matches is written as entry into a database of search results (alignmentDB). In iterative profile search mode, the detected sequences satisfying user-specified criteria are aligned to the query MSA, and the resulting query profile is used for the next search iteration. Iterative profile searches are usually much more sensitive than (and at least as sensitive as) searches with single query sequences.

Please cite: Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

© Martin Steinegger martin.steinegger@mpibpc.mpg.de

Usage: [options]

prefilter options default description [value range] --comp-bias-corr 0 correct for locally biased amino acid composition [0,1]
--add-self-matches false artificially add entries of queries with themselves (for clustering) -s 5.700 sensitivity: 1.0 faster; 4.0 fast default; 7.5 sensitive [1.0,7.5] -k 0 k-mer size in the range (0: set automatically to optimum)
--k-score 2147483647 k-mer threshold for generating similar-k-mer lists
--alph-size 21 alphabet size [2,21]
--offset-result 0 Offset result list
--split 0 Splits input sets into N equally distributed chunks. The default value sets the best split automatically. createindex can only be used with split 1. --split-mode 2 0: split target db; 1: split query db; 2: auto, depending on main memory --split-memory-limit 0 Maximum system memory in megabyte that one split may use. Defaults (0) to all available system memory. --diag-score 1 use diagonal score for sorting the prefilter results [0,1]
--exact-kmer-matching 0 only exact k-mer matching [0,1]
--mask 1 0: w/o low complexity masking, 1: with low complexity masking --min-ungapped-score 15 accept only matches with ungapped alignment score above this threshold --spaced-kmer-mode 1 0: use consecutive positions a k-mers; 1: use spaced k-mers --spaced-kmer-pattern User-specified spaced k-mer pattern
--local-tmp Path where some of the temporary files will be created
--disk-space-limit 0 Set the maximum disk space (in Mb) to use for reverse profile searches. Defaults (0) to all available disk space in the temp folder.

align options default description [value range] -a false add backtrace string (convert to alignments with mmseqs convertalis utility) --alignment-mode 2 How to compute the alignment: 0: automatic; 1: only score and end_pos; 2: also start_pos and cov; 3: also seq.id; 4: only ungapped alignment -e 0.001 list matches below this E-value [0.0, inf]
--min-seq-id 0.000 list matches above this sequence identity (for clustering) [0.0,1.0] --seq-id-mode 0 0: alignment length 1: shorter, 2: longer sequence
--alt-ali 5 Show up to this many alternative alignments
-c 0.000 list matches above this fraction of aligned (covered) residues (see --cov-mode) --cov-mode 0 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length, 4: query seq. length needs be at least x% of target length --realign false compute more conservative, shorter alignments (scores and E-values not changed) --max-rejected 2147483647 maximum rejected alignments before alignment calculation for a query is aborted --max-accept 2147483647 maximum accepted alignments before alignment calculation for a query is stopped --score-bias 0.000 Score bias when computing the SW alignment (in bits)
--gap-open 11 Gap open cost
--gap-extend 1 Gap extension cost

profile options default description [value range] --pca 1.000 pseudo count admixture strength
--pcb 1.500 pseudo counts: Neff at half of maximum admixture (0.0,infinity) --mask-profile 1 mask query sequence of profile using tantan [0,1]
--e-profile 0.100 includes sequences matches with < e-value thr. into the profile [>=0.0] --wg false use global sequence weighting for profile calculation
--filter-msa 1 filter msa: 0: do not filter, 1: filter
--max-seq-id 0.900 reduce redundancy of output MSA using max. pairwise sequence identity [0.0,1.0] --qid 0.000 reduce diversity of output MSAs using min.seq. identity with query sequences [0.0,1.0] --qsc -20.000 reduce diversity of output MSAs using min. score per aligned residue with query sequences [-50.0,100.0] --cov 0.000 filter output MSAs using min. fraction of query residues covered by matched sequences [0.0,1.0] --diff 1000 filter MSAs by selecting most diverse set of sequences, keeping at least this many seqs in each MSA block of length 50 --num-iterations 1 Search iterations

misc options default description [value range] --db-load-mode 0 Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch --rescore-mode 0 Rescore diagonal with: 0: Hamming distance, 1: local alignment (score only) or 2: local alignment --min-length 30 minimum codon number in open reading frames
--max-length 32734 maximum codon number in open reading frames
--max-gaps 2147483647 maximum number of codons with gaps or unknown residues before an open reading frame is rejected --contig-start-mode 2 Contig start can be 0: incomplete, 1: complete, 2: both
--contig-end-mode 2 Contig end can be 0: incomplete, 1: complete, 2: both
--orf-start-mode 1 Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) --forward-frames 1,2,3 comma-seperated list of ORF frames on the forward strand to be extracted --reverse-frames 1,2,3 comma-seperated list of ORF frames on the reverse strand to be extracted --translation-table 1 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE, 9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL, 15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL, 23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA, 29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA --use-all-table-starts false use all alteratives for a start codon in the genetic table, if false - only ATG (AUG) --id-offset 0 numeric ids in index file are offset by this value
--add-orf-stop false add * at complete start and end
--start-sens 4.000 start sensitivity
--sens-steps 1 Search steps performed from --start-sense and -s.

common options default description [value range] --sub-mat blosum62.out amino acid substitution matrix file
--max-seq-len 65535 Maximum sequence length [1,32768]
--max-seqs 300 maximum result sequences per query (this parameter affects the sensitivity) --threads 1 number of cores used for the computation (uses all cores by default) -v 3 verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info

Unrecognized parameter --format-output Did you mean "--max-accept"? Error: Search died

Context

Trying to run against Pfam-A, it was working a couple months ago.

Your Environment

mmseqs compiled by myself in MacOSX Mojave.

martin-steinegger commented 5 years ago

@gaboentropy thanks for reporting the issue. It should be fixed now.

gaboentropy commented 5 years ago

It works all right now. Thanks.

martin-steinegger commented 5 years ago

Could you please send me the commands? Did you want to turn off comp-bias correction?

gaboentropy commented 5 years ago

Sorry. My mistake.

Now I see that the database index has to be built according to the kind of correction that's going to be used with easy-search. That wasn't the case before. I was building the Pfam database as suggested in your wiki, which defaults to --comp-bias-corr 1, but running easy-search with --comp-bias-corr 0. Since it was working all right before, I had not noticed that createindex was making a bias correction too.

gaboentropy commented 5 years ago

The database commands were:

mmseqs msa2profile Pfam-A.msa Pfam-A --match-mode 1 mmseqs createindex Pfam-A $tempFolder -k 5 -s 7

The easy-search was: mmseqs easy-search inputFile.faa Pfam-A resultfile.mmseqs $tempFolder --comp-bias-corr 0 --alt-ali 5 --format-output [my format]

That would work before, now it fails telling me to rebuild the index with --comp-bias-corr 0. I did that and it works. Now I know I have to stick to either correction or no correction from the database index on.