Seqtool is a fast and flexible command line program for dealing with
large amounts of biological sequences.
It provides different subcommands for converting, inspecting
and modifying sequences.
The standalone binary (~6 MB) is simply named st
to save some typing.
Note: this page describes the development version 0.4-beta.
The older stable version (v0.3.0) is documented here.
Detailed documentation
See this page
Downloads
📥 download stable release (v0.3.0)
âš Note: there are a few unfixed bugs in v0.3.0 (currently)
when reading GZIP files or searching/replacing;
see CHANGELOG for v0.4.0-beta.
Alternatively, consider using v0.4.0-beta.
📥 download beta release (v0.4.0-beta)
Should be pretty safe to use despite considerable refactoring.
Approximate matching (https://markschl.github.io/seqtool-docs/[find](find) command) is yet to be fully tested.
Feature overview
File formats
Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed
with GZIP, BZIP2,
or the faster and more modern Zstandard or LZ4
formats
Example: compressed FASTQ to FASTA
Combine multiple compressed FASTQ files, converting them to FASTA, using [pass](https://markschl.github.io/seqtool-docs/pass).
```bash
st pass file1.fastq.gz file2.fastq.gz -o output.fasta
```
> **Note**: almost every command can read multiple input files and convert between formats,
> but *pass* does nothing other than reading and writing while other command perform certain actions.
Example: FASTA to tab-separated list
Aside from ID and sequence, any [variable/function](https://markschl.github.io/seqtool-docs/variables) such as
the sequence length (`seqlen`) can be written to delimited text.
```bash
st pass input.fasta --to-tsv id,seq,seqlen
```
```
id1 ACG 3
id1 ACGTACGT 7
id1 ACGTA 5
```
Highly versatile thanks to variables/functions
See also variables/functions for more details.
Example: count sequences in a large set of FASTQ files
```bash
st count -k path data/*.fastq.gz
```
```
data/sample1.fastq.gz 30601
data/sample2.fastq.gz 15702
data/sample3.fastq.gz 264965
data/sample4.fastq.gz 1120
data/sample5.fastq.gz 7021
(...)
```
> In [count](https://markschl.github.io/seqtool-docs/count), one or several categorical [variables/functions](https://markschl.github.io/seqtool-docs/variables)
> can be specified with `-k/--key`.
Example: summarize the GC content in 10% intervals
The function `bin(variable, interval)` groups continuous numeric values
into intervals
```bash
st count -k 'bin(gc_percent, 10)' sequences.fasta
```
```
(10, 20] 57
(20, 30] 2113
(30, 40] 11076
(40, 50] 7184
(50, 60] 12
```
Example: Assign new sequence IDs
```bash
st set -i 'seq_{num}' seqs.fasta > renamed.fasta
```
```
>seq_1
SEQUENCE
>seq_2
SEQUENCE
>seq_3
SEQUENCE
(...)
```
Example: De-replicate by description and sequence
`seqs.fasta` with a 'group' annotation in the header:
```
>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id3 group1
SEQUENCE2
>id4 group2
SEQUENCE1
>id5 group2
SEQUENCE1
```
```bash
st unique 'desc,seq' seqs.fasta > grouped_uniques.fasta
```
```
>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id4 group2
SEQUENCE1
```
Expressions
From simple math to complicated filter expressions, the tiny integrated JavaScript engine
(QuickJS) offers countless possibilities for customized
sequence processing.
Example: filter FASTQ sequences by quality and length
This [filter](https://markschl.github.io/seqtool-docs/filter) command removes sequencing reads with more than one expected
sequencing error (like [USEARCH](https://www.drive5.com/usearch/manual/exp_errs.html) can do)
or sequence length of <100 bp.
```bash
st filter 'exp_err < 1 && seqlen >= 100' reads.fastq > filtered.fastq
```
Header attributes for metadata storage
key=value
header attributes allow storing and passing on
all kinds of information
Example: De-replicate by sequence (seq variable) and/or other properties
The [unique](https://markschl.github.io/seqtool-docs/unique) command returns all unique sequences and annotates
the number of records with the same sequence in the header:
```bash
st unique seq -a abund={n_duplicates} input.fasta > uniques.fasta
```
```
>id1 abund=3
TCTTTAATAACCTGATTAG
>id3 abund=1
GGAGGATCCGAGCG
(...)
```
It is also possible to de-replicate by multiple keys, e.g. by sequence,
but grouped by a `sample` attribute in the header:
```bash
st unique 'seq,attr(sample)' input.fasta > uniques.fasta
```
```
>id1 sample=1
SEQUENCE1
>id3 sample=2
SEQUENCE2
>id10 sample=1
SEQUENCE3
>id11 sample=3
SEQUENCE4
(...)
```
Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon)
These steps could be part of an amplicon pipeline that de-multiplexes
multi-marker amplicons.
[find](https://markschl.github.io/seqtool-docs/find) searches for a set of primers, which are removed by [trim](https://markschl.github.io/seqtool-docs/trim),
and finally [split](https://markschl.github.io/seqtool-docs/split) distributes the sequences into different files named
by the forward primer.
**primers.fasta**
```
>prA
PRIMER
>prB
PRIMER
```
**Command for searching/trimming**
```bash
st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
st trim -e '{attr(end)}..' |
st split -o '{attr(primer)}'
```
prA.fasta | prB.fasta | undefined.fasta |
```
>id1 primer=prA end=22
SEQUENCE
>id4 primer=prA end=21
SEQUENCE
(...)
```
|
```
>id2 primer=prB end=20
SEQUENCE
>id3 primer=prB end=22
SEQUENCE
(...)
```
|
```
>id5 primer=undefined end=undefined
UNTRIMMEDSEQUENCE
(...)
```
*Note:* no primer, sequence **not** trimmed since `end=undefined` (https://markschl.github.io/seqtool-docs/see [ranges](ranges)).
|
Integration of external metadata
Integration of sequence metadata sources in the form of delimited text
Example: Add Genus names from a separate tab-separated list
input.fasta | genus.tsv |
```
>id1
SEQUENCE
>id2
SEQUENCE
(...)
```
|
```
id genus
seq1 Actinomyces
seq2 Amycolatopsis
(...)
```
|
Using `-m/--meta` to include `genus.tsv` as metadata source:
```bash
st set -m genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta
```
with_genus.fasta |
```
>seq1 Actinomyces
SEQUENCE
>seq2 Amycolatopsis
SEQUENCE
(...)
```
|
Example: Choose specific sequences given a separate file with an ID list
input.fasta | id_list.txt |
```
>id1
SEQUENCE
>id2
SEQUENCE
>id3
SEQUENCE
>id4
SEQUENCE
```
|
```
id1
id4
```
|
```bash
st filter -m id_list.txt 'has_meta()' input.fasta > subset.fasta
```
subset.fasta |
```
>id1
SEQUENCE
>id4
SEQUENCE
```
|
Commands
Basic conversion/editing
- pass: Directly pass input to output without any processing, useful for converting and
attribute setting
Information about sequences
- view: View biological sequences, colored by base / amino acid, or by sequence quality
- count: Count all records in the input (total or categorized by variables/functions)
- stat: Return per-sequence statistics as tab delimited list
Subset/shuffle
- sort: Sort records by sequence or any other criterion
- unique: De-replicate by sequence and/or other properties, returning only unique records
- filter: Keep/exclude sequences based on different properties with a mathematical
(JavaScript) expression
- split: Distribute sequences into multiple files based on a variable/function or
advanced expression
- sample: Get a random subset of sequences; either a fixed number or an approximate
fraction of the input
- slice: Return a range of sequence records from the input
- head: Return the first N sequences
- tail: Return the last N sequences
- interleave: Interleave records of all files in the input
Search and replace
- find: Search for pattern(s) in sequences or sequene headers for record filtering,
pattern replacement or passing hits to next command
- replace: Fast and simple pattern replacement in sequences or headers
Modifying commands
- del: Delete header ID/description and/or attributes
- set: Replace the header, header attributes or sequence with new content
- trim: Trim sequences on the left and/or right (single range) or extract and
concatenate several ranges
- mask: Soft or hard mask sequence ranges
- upper: Convert sequences to uppercase
- lower: Convert sequences to lowercase
- revcomp: Reverse complements DNA or RNA sequences
- concat: Concatenates sequences/alignments from different files
Comparison with other tools
There are other tools with a similar focus such as Seqtk,
SeqKit, the FASTX-Toolkit,
as well as the more specialized USEARCH and
VSEARCH offering some of the functions
as well.
Seqtool performs well compared to these tools on a selection of diverse tasks:
Comparison of tools