knights-lab / BURST

An ultrafast optimal aligner for mapping large NGS data to large genome databases.
GNU Affero General Public License v3.0
57 stars 8 forks source link

Support for reading from standard input? #29

Open mikemc opened 4 years ago

mikemc commented 4 years ago

My understanding is that currently BURST can only take in reads as an unzipped Fasta file. I am wondering if taking reads from standard input might be implemented at some point. Here are two use cases where I would find it helpful to be able to pass the reads in from standard input:

  1. My sequence data is stored in Gzipped Fastq files, but BURST wants unzipped Fasta files. It would be handy if I could simply pipe the output from a tool like seqtk (which can convert to unzipped Fasta format) into BURST, rather than having to save the intermediate Fasta files, which I don't need for any other purpose. In this way, allowing standard input partially gets around BURST's atypically need for reads as unzipped Fasta files.

  2. I am testing mapping pipelines on simulated reads. It would be handy sometimes to be able to pipe the simulated reads right into BURST.

GabeAl commented 4 years ago

Hi mikemc,

It is (unofficially) supported. :) You can pipe fasta format into BURST, but be sure to use "/dev/stdin" as the input filename. This feature hasn't been tested much.

Thanks, Gabe

GabeAl commented 4 years ago

One more comment -- BURST is designed to map multiple samples of shallow shotgun sequencing simultaneously, rather than one sample at a time. This is the reason for its fasta format requirement. Preprocessing is expected to take place with SHI7 (my QC toolkit), as this toolkit multiplexes multiple samples into a single "combined_seqs.fna" fasta file, much like QIIME1.

In this way, there is little distinction between amplicon and shallow shotgun sequencing -- both can be preprocessed by SHI7 in the same way, multiplexed in the same way, and aligned all together with BURST. BURST performs faster, and its capitalist algorithm performs better, when there are more queries in the same input file (scaling sublinearly with sequence depth, provided there is enough RAM to house the entire combined_seqs.fna).

It is not the expectation that users simply convert existing fastq's to fasta format and run alignment (although doing so would still work).

The expected pipeline is:

  1. SHI7 to produce the QC'd, adaptor-trimmed, stitched (potentially) or appended R1/R2 if paired-end was used, in a single combined_seqs.fna file that includes all samples with sample names delimited by underscore
  2. BURST to align the single combined_seqs.fna file in one run
  3. embalmulate on the resulting .b6 alignment file to produce the final table of counts (which it will split per sample again based on sample name).

Hope this clarifies!

mikemc commented 4 years ago

Thanks Gabe, both clarifications are very helpful. I have been BURST as a point of comparison for BBMap/BBSplit in a BBTools-based pipeline, so haven't bothered with SHI7 but will look into it for future use, especially if I were to use Capitalist. (Allpaths and Forage modes are most useful for my current needs)