markschl / seq_io

FASTA and FASTQ parsing in Rust
MIT License
68 stars 11 forks source link
fasta fastq fastx parser rust

FASTA and FASTQ parsing and writing in Rust.

docs.rs crates.io Build status

This library provides an(other) attempt at parsing of the sequence formats FASTA and FASTQ, as well as writing.

Features:

The FASTA parser can read and write multi-line files and allows iterating over the sequence lines without doing any allocation or copying. The FASTQ parser does not support multiple sequence / quality lines.

Documentation

Documentation for the stable version (0.3.x)

The v0.4 branch contains code for a new version, which includes a FASTX reader. Although it works and has been tested to some extent, there will be further large changes, which are not quite ready yet.

Documentation for development version (0.4.0-alpha.x)

Example

Reads FASTA sequences from STDIN and writes them to STDOUT if long enough. Otherwise it prints a message. This should be very fast because the sequence is not allocated (seq_lines()).

use seq_io::fasta::{Reader,Record};
use std::io;

let mut reader = Reader::new(io::stdin());
let mut stdout = io::stdout();

while let Some(result) = reader.next() {
    let record = result.unwrap();
    // determine sequence length
    let seqlen = record.seq_lines()
                       .fold(0, |l, seq| l + seq.len());
    if seqlen > 100 {
        record.write_wrap(&mut stdout, 80).unwrap();
    } else {
        eprintln!("{} is only {} long", record.id().unwrap(), seqlen);
    }
}

Records are directly borrowing data from the internal buffered reader, therefore the while let is required. By default, the buffer will automatically grow if a record is too large to fit in. How it grows can be configured, it is also possible to set a size limit. Iterators over owned records are also provided.

Note: Make sure to add lto = true to the release profile in Cargo.toml for full performance. Calls to functions of the underlying buffered reader (buffer_redux) are not inlined otherwise.

Multi-threaded processing

The parallel module contains functions for sending FASTQ/FASTA records to a thread pool where expensive calculations are done. Sequences are processed in batches (RecordSet) because sending across channels has a performance impact. FASTA/FASTQ records can be accessed in both the 'worker' function and (after processing) a function running in the main thread.

Similar projects in Rust

Performance comparisons

The FASTQ reader from this crate performs similar to the fastq-rs reader. The rust-bio readers are slower due to allocations, copying, and UTF-8 validity checks.

All comparisons were run on a set of 100,000 auto-generated, synthetic sequences with lengths normally distributed around 500 bp and loaded into memory. The parsers from this crate (seq_io) are compared with fastq-rs (fastq_rs) and Rust-Bio (bio). The bars represent the throughput in GB/s (+/- standard error of the mean). Run on a Thinkpad X1 Carbon (i7-5500U) with a fixed frequency of 2.3 GHz using Rust 1.31 nightly

benchmark results

Explanation of labels: