gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

biobambam2


Important notice: this is legacy code, recent development on biobambam2 can be found at https://gitlab.com/german.tischler/biobambam2 .


This package contains some tools for processing BAM files including

A short list of options is available for each program by calling it with the -h parameter, e.g.

bamsort -h

Source

The biobambam2 source code is hosted on github:

git@github.com:gt1/biobambam2.git

Release packages can be found at

https://github.com/gt1/biobambam2/releases

Please make sure to choose a package containing the word "release" in it's name if you intend to compile biobambam2 for production (i.e. non development) use.

Compilation of biobambam2

biobambam2 needs libmaus2 [https://github.com/gt1/libmaus2] . When libmaus2 is installed in ${LIBMAUSPREFIX} then biobambam2 can be compiled and installed in ${HOME}/biobambam2 using

- autoreconf -i -f
- ./configure --with-libmaus2=${LIBMAUSPREFIX} \
    --prefix=${HOME}/biobambam2
- make install

The release packages come with a configure script included (making the autoreconf call unnecessary for source obtained via one of those).

Command line arguments

Different from a lot of other command line tools most options for biobambam2 commands are passed as key=value pairs. An example for sorting a BAM file by name order is:

bamsort SO=queryname <in.bam >out.bam

Using bamsormadup

bamsormadup is a new tool in biobambam2. In has two modes of operation. If SO=coordinate (as it is by default) then it expects a name collated (all reads for one name appear consecutively) input file. It sorts this file by coordinate and marks duplicate reads and read pairs and outputs the sorted file in bam, cram or sam format. If SO=queryname then it expects an input file in any order and sorts this file by query name. In both cases the program can produce checksums at read set level (like bamseqchksum) and at file level (like md5sum). If the output file is sorted by coordinate and in bam format, then program can also produce a bam index on the fly. Most stages of bamsormadup are parallelised. The number of threads used can be set with the threads option (e.g. threads=4). If no such option is given then the number of logical CPUs (cores) of the machine is used as the number of threads. Parallel sorting of alignment files is I/O heavy, so a fast I/O system is crucial. We recommend to store all data (input, temporary and output) on solid state storage (SSD).

bamsormadup and cram output

CRAM output with bamsormadup requires libmaus to be built with support for io_lib version 1.3.11 (or newer) or a sufficiently recent svn revision of version 1.3.10. The binaries provided on github have support for CRAM writing. The program will look for reference sequences in the following places while encoding cram:

If the program cannot find a required reference sequence for encoding CRAM in any of these locations, then it fails. Note that if a reference sequence is only present in a FastA file then the REF_CACHE environment variable must be set or CRAM encoding will fail in the current version.

Using blastnxmltobam

The blastnxmltobam program can be used to transform blastn's XML output to the BAM format. An example is shown below:

makeblastdb -in ref.fasta -dbtype nucl
blastn -outfmt 5 -query query.fasta -db ref.fasta | blastnxmltobam ref.fasta query.fasta >query.bam

The compilation of blastnxmltobam requires the xerces-c (see https://xerces.apache.org/xerces-c/) library for XML parsing (see configure switch --with-xerces-c).