HDF5-based Fastq container

tanaes commented 7 years ago

Problem:

Shotgun metagenomic analysis create a large number of partially-redundant Fastq files which occupy a lot of disk space.

As an example, a typical workflow looks something like this, per sample:

Initial Fastq.gz (1.1 Gb)
QC/trimmed Fastq.gz (1.0 Gb)
Host-filtered Fastq.gz (0.9 Gb)

So for a typical workflow, the initial data is likely to be ~triplicated on disk. Any additional comparison of parameter values, for example different trimming operations, would require additional duplication of data.

Compounding the problem, different programs use different 'flavors' of Fastq format.

Trimmomatic outputs (up to) four files per trimming run:

R1 (surviving pairs, forward)
R2 (surviving pairs, reverse)
U1 (surviving forward with no surviving reverse mate)
U2 (surviving reverse with no surviving forward mate)

Other trimming programs will output interleaved forward/reverse pairs, either discarding pairs where one mate was filtered or inserting empty sequence.

Analysis programs also sometimes require either interleaved format or non-interleaved format. Additionally, some programs can and some cannot accept gzipped input files. In a given analysis, it might be necessary to take a single trimmed, host-filtered Fastq file and reprocess it several different ways on disk to satisfy different downstream programs.

Proposed solution:

An HDF5-based Fastq container format that can read multiple flavors of Fastq sequence file, store them in a compressed but accessible structure, and efficiently write temporary Fastq sequences files of various flavors for use by downstream tools.

Desired characteristics

Compressed data structure
Simple, fast command-line methods to read and write various Fastq flavors
Efficient read-name indexing for selective retrieval of reads (e.g. for filtering from a BAM alignment)
Compact, additive representation of trimming or filtering steps
Ability to scale up or down container to include multiple additional files (e.g. multiple lanes from a sample, or multiple samples in a run)

I'll address each of these briefly.

1. Compressed data structure

We should get this for free with HDF5

2. Simple, fast methods for writing Fastq

Because these would be a purely intermediate format, we're trading compute time necessary to generate the appropriate Fastq type for storage. Thus we'll likely have to write a complete Fastq file every time we want to interact with the object: generating a new file should be fast as possible.

3. Efficient read-name indexing

If the unique read names can be indexed in an accessible way, it will allow selective retrieval of data subsets in cases where you only want to look at a portion of the reads.

4. Compact, additive trimming info

Rather than storing entirely new Fastq files for new trimming steps, it would be great to just store a lightweight mask of the trimmed positions -- this mask could be added to the original sequence container to produce the intermediate, trimmed Fastq for analysis.

It might also be useful if these could be additive -- for example, I might want to trim a Fastq for adapters before host filtering, but then be able to propagate the host trimming information back with reference to the original sequences.

Because trimming programs don't necessarily yield rich per-sequence information about trimming outcomes, there will need to be some simple aligner (or similar) that can calculate the trim positions by comparing the original and trimmed sequence.

5. Container scaling

Frequently, a single biological sample will get sequenced on multiple lanes, generating several starting Fastq files. It would be nice to be able to treat these as a single file while maintaining the run information.

Proposed HDF5 file structure

A basic representation would store a single Fastq file, with each read set off the instrument stored as an N x M array (number of reads by read length) for fast streaming writes. The different datasets (fwd, rev, i1 and i2) will be in the same order. That is, row 1 belongs to the same "read".

./fwd/
    <seq : N x M array of char>
    <qual : N x M array of int>
./rev/
    <seq : N x M array of char>
    <qual : N x M array of int>
./i1/
    <seq : N x M array of char>
    <qual : N x M array of int>
./i2/
    <seq : N x M array of char>
    <qual : N x M array of int>
./read names/
    /<names : groups of sorted(?) read names>/
        <idx : index of (original) position of corresponding seq or qual>

Trimming results could be stored as a separate HDF5 referencing the original data structure:

/trim_result1/
    /<link to dataset>/
        [idx, fwd_start, fwd_end, rev_start, rev_end]

Optionally, the trimming results could also be added to the original HDF5 file as a new group to store all the information in a single file, which will simplify data sharing.

josenavas commented 7 years ago

cc @antgonza @wasade We would like to start getting some initial code to see how it will behave/perform, but before putting too much effort we would like to get your input in the overall design and check if there is anything that we miss or you think that an alternative representation is better. Thanks!

wasade commented 7 years ago

What are the groups i1 and i2? Is this per-sample or dataset wide?

@tanaes I can't find the picture we had from before. This seems reasonably in sync from what I remember except that I thought we organized at the top level with a group per sample. This would necessitate per-sample groups in the trimmed results as well.

One beneficial tweak on the above: qual can safely be represented as uint8. char is already a short.

Jagged arrays are not allowed so M will need to be the max sequence length within a dataset with some denotation for the end of a sequence, or better yet, it would be useful to track a separate dataset in index order containing sequence lengths. The issue with M presents a benefit beyond organizational for handling this per-sample as there will be space savings as the exposure to the longest read is isolated to the sample(s) with the longest read.

If the fwd trims are consistent, ie always X nt, then storing as column order may be better. If the trim is variable, then row order makes sense. If we want to eek out a little more storage though, storing as column order will allow for better compression as there will be long stretches of nulls for sequences that aren't of length M.

tanaes commented 7 years ago

We were thinking that it would be nice to have a base structure that worked for a single fastq, but which could be combined hierarchically to represent multi-sample (or multiple sequence files per sample) structures.

Regarding array size and M, the initial files coming of the Illumina sequencer should all be exactly the same length.

Forward trims are likely to not be consistent, or not enough to depend on.

wasade commented 7 years ago

Oh, I see. Would it be possible to get a use case description for representing multiple sequence files per sample within this structure?

Good point re: M.

Looking back at the format description, you may want to partition the trim information into individual datasets, or store them in column order, as I suspect the access pattern would be to bulk fetch indices in some defined buffer size, then bulk fetch the corresponding offsets, then bulk fetch the sequence.

ElDeveloper commented 7 years ago

I am less familiar with the use-cases of FASTQ files beyond 16S. The structure above makes sense.

From searching around, seems like PacBio maintains a format to store FASTQ data in HDF5, not sure if it's entirely applicable, definitely under-documented: https://github.com/PacificBiosciences/pbh5tools

Side note, I was reminded of this article from a consortium of image processing experts. Different problem, different domain, but some of their "lessons learned" are good! http://queue.acm.org/detail.cfm?id=1628215

tanaes commented 7 years ago

Notes from Qiita meeting:

File / sample architecture:

Daniel thinks that having per-sample files is good idea.

Jon agrees, notes that much bioinformatics software operates on this assumption.

Jose notes that there are some advantages to storing multiple samples together.

per-lane:

Question of whether samples run on multiple lanes, which come from same molecules but end up with different originating fastq files, should exist in separate original files or

Downloading

If storing all fastqs as hdf5 need to figure out mechanisms for downloading. Need to either have some way of serializing -> gzipping on the fly for downloading, or have some sort of staging to generate (and wait) before download.

Yoshiki: how often are people downloading? Could check logs to see how frequent this is likely to come up.

Other

Perhaps we can get something bare bones put together for next Wednesay!!

qiita-spots / qiita-files