markschl / seqtool

Fast and flexible tool for reading, modifying and writing biological sequences
MIT License
17 stars 1 forks source link

Seqtool is a fast and flexible command line program for dealing with large amounts of biological sequences. It provides different subcommands for converting, inspecting and modifying sequences. The standalone binary (~6 MB) is simply named st to save some typing.

CI

Note: this page describes the development version 0.4-beta. The older stable version (v0.3.0) is documented here.

Detailed documentation

See this page

Downloads

📥 download stable release (v0.3.0)

âš  Note: there are a few unfixed bugs in v0.3.0 (currently) when reading GZIP files or searching/replacing; see CHANGELOG for v0.4.0-beta. Alternatively, consider using v0.4.0-beta.

📥 download beta release (v0.4.0-beta)

Should be pretty safe to use despite considerable refactoring. Approximate matching (https://markschl.github.io/seqtool-docs/[find](find) command) is yet to be fully tested.

Feature overview

File formats

Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed with GZIP, BZIP2, or the faster and more modern Zstandard or LZ4 formats

Example: compressed FASTQ to FASTA Combine multiple compressed FASTQ files, converting them to FASTA, using [pass](https://markschl.github.io/seqtool-docs/pass). ```bash st pass file1.fastq.gz file2.fastq.gz -o output.fasta ``` > **Note**: almost every command can read multiple input files and convert between formats, > but *pass* does nothing other than reading and writing while other command perform certain actions.
Example: FASTA to tab-separated list Aside from ID and sequence, any [variable/function](https://markschl.github.io/seqtool-docs/variables) such as the sequence length (`seqlen`) can be written to delimited text. ```bash st pass input.fasta --to-tsv id,seq,seqlen ``` ``` id1 ACG 3 id1 ACGTACGT 7 id1 ACGTA 5 ```

Highly versatile thanks to variables/functions

See also variables/functions for more details.

Example: count sequences in a large set of FASTQ files ```bash st count -k path data/*.fastq.gz ``` ``` data/sample1.fastq.gz 30601 data/sample2.fastq.gz 15702 data/sample3.fastq.gz 264965 data/sample4.fastq.gz 1120 data/sample5.fastq.gz 7021 (...) ``` > In [count](https://markschl.github.io/seqtool-docs/count), one or several categorical [variables/functions](https://markschl.github.io/seqtool-docs/variables) > can be specified with `-k/--key`.
Example: summarize the GC content in 10% intervals The function `bin(variable, interval)` groups continuous numeric values into intervals ```bash st count -k 'bin(gc_percent, 10)' sequences.fasta ``` ``` (10, 20] 57 (20, 30] 2113 (30, 40] 11076 (40, 50] 7184 (50, 60] 12 ```
Example: Assign new sequence IDs ```bash st set -i 'seq_{num}' seqs.fasta > renamed.fasta ``` ``` >seq_1 SEQUENCE >seq_2 SEQUENCE >seq_3 SEQUENCE (...) ```
Example: De-replicate by description and sequence `seqs.fasta` with a 'group' annotation in the header: ``` >id1 group1 SEQUENCE1 >id2 group1 SEQUENCE2 >id3 group1 SEQUENCE2 >id4 group2 SEQUENCE1 >id5 group2 SEQUENCE1 ``` ```bash st unique 'desc,seq' seqs.fasta > grouped_uniques.fasta ``` ``` >id1 group1 SEQUENCE1 >id2 group1 SEQUENCE2 >id4 group2 SEQUENCE1 ```

Expressions

From simple math to complicated filter expressions, the tiny integrated JavaScript engine (QuickJS) offers countless possibilities for customized sequence processing.

Example: filter FASTQ sequences by quality and length This [filter](https://markschl.github.io/seqtool-docs/filter) command removes sequencing reads with more than one expected sequencing error (like [USEARCH](https://www.drive5.com/usearch/manual/exp_errs.html) can do) or sequence length of <100 bp. ```bash st filter 'exp_err < 1 && seqlen >= 100' reads.fastq > filtered.fastq ```

Header attributes for metadata storage

key=value header attributes allow storing and passing on all kinds of information

Example: De-replicate by sequence (seq variable) and/or other properties The [unique](https://markschl.github.io/seqtool-docs/unique) command returns all unique sequences and annotates the number of records with the same sequence in the header: ```bash st unique seq -a abund={n_duplicates} input.fasta > uniques.fasta ``` ``` >id1 abund=3 TCTTTAATAACCTGATTAG >id3 abund=1 GGAGGATCCGAGCG (...) ``` It is also possible to de-replicate by multiple keys, e.g. by sequence, but grouped by a `sample` attribute in the header: ```bash st unique 'seq,attr(sample)' input.fasta > uniques.fasta ``` ``` >id1 sample=1 SEQUENCE1 >id3 sample=2 SEQUENCE2 >id10 sample=1 SEQUENCE3 >id11 sample=3 SEQUENCE4 (...) ```
Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon) These steps could be part of an amplicon pipeline that de-multiplexes multi-marker amplicons. [find](https://markschl.github.io/seqtool-docs/find) searches for a set of primers, which are removed by [trim](https://markschl.github.io/seqtool-docs/trim), and finally [split](https://markschl.github.io/seqtool-docs/split) distributes the sequences into different files named by the forward primer. **primers.fasta** ``` >prA PRIMER >prB PRIMER ``` **Command for searching/trimming** ```bash st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta | st trim -e '{attr(end)}..' | st split -o '{attr(primer)}' ```
prA.fasta prB.fastaundefined.fasta
``` >id1 primer=prA end=22 SEQUENCE >id4 primer=prA end=21 SEQUENCE (...) ``` ``` >id2 primer=prB end=20 SEQUENCE >id3 primer=prB end=22 SEQUENCE (...) ``` ``` >id5 primer=undefined end=undefined UNTRIMMEDSEQUENCE (...) ``` *Note:* no primer, sequence **not** trimmed since `end=undefined` (https://markschl.github.io/seqtool-docs/see [ranges](ranges)).

Integration of external metadata

Integration of sequence metadata sources in the form of delimited text

Example: Add Genus names from a separate tab-separated list
input.fastagenus.tsv
``` >id1 SEQUENCE >id2 SEQUENCE (...) ``` ``` id genus seq1 Actinomyces seq2 Amycolatopsis (...) ```
Using `-m/--meta` to include `genus.tsv` as metadata source: ```bash st set -m genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta ```
with_genus.fasta
``` >seq1 Actinomyces SEQUENCE >seq2 Amycolatopsis SEQUENCE (...) ```
Example: Choose specific sequences given a separate file with an ID list
input.fastaid_list.txt
``` >id1 SEQUENCE >id2 SEQUENCE >id3 SEQUENCE >id4 SEQUENCE ``` ``` id1 id4 ```
```bash st filter -m id_list.txt 'has_meta()' input.fasta > subset.fasta ```
subset.fasta
``` >id1 SEQUENCE >id4 SEQUENCE ```

Commands

Basic conversion/editing

Information about sequences

Subset/shuffle

Search and replace

Modifying commands

Comparison with other tools

There are other tools with a similar focus such as Seqtk, SeqKit, the FASTX-Toolkit, as well as the more specialized USEARCH and VSEARCH offering some of the functions as well.

Seqtool performs well compared to these tools on a selection of diverse tasks:

Comparison of tools