gt1 / daccord

d'accord is a non hybrid long read consensus program based on local de Bruijn graph assembly
Other
19 stars 1 forks source link

d'accord

d'accord is a non hybrid long read consensus program based on local de Bruijn graph assembly. The package contains the following programs

A short list of options is available for each program by calling it with the -h parameter, e.g.

daccord -h

Source

The daccord source code is hosted on github:

git@github.com:gt1/daccord.git

Release packages can be found at

https://github.com/gt1/daccord/releases

Please make sure to choose a package containing the word "release" in it's name if you intend to compile daccord for production (i.e. non development) use.

Compilation of daccord

daccord needs libmaus2 [https://github.com/gt1/libmaus2] . libmaus2 needs to be built with support for the GMP library. When libmaus2 is installed in ${LIBMAUSPREFIX} then daccord can be compiled and installed in ${HOME}/daccord using

- autoreconf -i -f
- ./configure --with-libmaus2=${LIBMAUSPREFIX} \
    --prefix=${HOME}/daccord
- make install

The release packages come with a configure script included (making the autoreconf call unnecessary for source obtained via one of those).

Citations:

The core algorithms used by daccord are described in the following two papers:

The first paper describes the algorithm used in the daccord program. The second one explains the algorithms used by the programs split_agr and split_dis.

daccord

daccord is the main consensus computation program. It requires the name of an LAS file and a Dazzler database as input arguments:

daccord reads.las reads.db

Please see DAZZ_DB (https://github.com/thegenemyers/DAZZ_DB) for creating Dazzler databases and DALIGNER (https://github.com/thegenemyers/DALIGNER) for creating LAS alignment files.

daccord has a set of optional parameters which can be set on the command line:

Please note that the parameters need to be provided before any other arguments (i.e. before the LAS file name and the name of the Dazzler DB).

daccord produces a FastA file on standard output. The read names follow the scheme

read-id/unique-id/from_to

An example for this is

2/7/500_23141

This designates that the respective sequence is the corrected version of read id 2 (in the trimmed Dazzler database) from base 500 to 23141. The unique-id field provides a unique identifier for each corrected fragment and has otherwise no specific meaning.

computeintrinsicqv

computeintrinsicqv computes intrinsic quality values and requires a numerical depth parameter -d, a Dazzler database and an LAS file to compute intrinsic quality values. An example call is:

computeintrinsicqv -d40 reads.db reads.las

The argument for the depth parameter is the sequencing depth, which is given by the number of sequenced bases divided by the size of the genome sequenced. The reads file can refer to a single block of the database, e.g.

computeintrinsicqv -d40 reads.db reads.1.las

is a valid call for the first block of reads.db . The database however needs to be complete, i.e.

computeintrinsicqv -d40 reads.1 reads.1.las

is not a valid call. When intrinsic quality values have been computed for all blocks of a database, then the quality tracks for the single blocks can be concatenated to the quality track of the complete database using the Catrack tool of DAZZ_DB.

lasdetectsimplerepeats

lasdetectsimplerepeats is a program which detects simple repeats (such which are completely contained in a read) by virtue of checking intrinsic quality values. The program takes a Dazzler database and an LAS file as input. An example call is

lasdetectsimplerepeats reads.db reads.las >reads.rep

The database given has to be complete, the LAS file can refer to a single block of the database.

lasdetectsimplerepeats has a set of optional parameters which can be set on the command line:

lasdetectsimplerepeats produces it's output on the standard output channel. The output for single blocks can be concatenated to the output for the whole database using the Unix cat program. This requires the outputs of lasdetectsimplerepeats to be concatenated in increasing order of the block id, i.e. run

cat reads.1.rep reads.2.rep ... reads.n.rep >reads.rep

if n is the number of blocks in the database.

lasfilteralignments

lasfilteralignments is a program which removes improper alignments from an LAS file. It requires intrinsic qualities to do so. The program takes a Dazzler database and an LAS file as input. An example call is

lasfilteralignments reads.db reads.las

This will create the output file reads_filtered.las .

lasfilteralignments has a set of optional parameters which can be set on the command line:

lasfilteralignmentsborderrepeats

lasfilteralignmentsborderrepeats is a program which removes (probably) repeat induced alignments involving a prefix or suffix of a read. The repeats are not detected by the program but need to be provided in the form of an output file as produced by lasdetectsimplerepeats. It expect four input arguments:

An example call is

lasfilteralignmentsborderrepeats out.las in.db in.rep in.las

lasfilteralignmentsborderrepeats has a set of optional parameters which can be set on the command line:

maftobam

maftobam converts alignments produced in the MAF format by PBsim to the BAM format. It requires one argument designating a FastA file containing the underlying reference. An example call is

maftobam ref.fasta <in.maf >out.bam

An optional second file name can be given to perform online reference id replacements. This is useful as PBsim does not output the name of the original reference sequence.

bamidrename

bamidrename processes a BAM file an replaces all read names after the scheme

L0/id/0_len

where id is the index of the alignment record (i.e. 0 for the first one, 1 for the second, etc.) and len is the length of the query sequence. This allows to identify the index of a record in the output file after the file has later been sorted to a different order.

generateperfectpiles

generateperfectpiles generates alignments contained in perfect alignment piles from a set of read to reference alignments given in a BAM file. The alignments produced are output in DALIGNER's LAS format. The input BAM file is assumed to have geen generated by first running bamidrename on it and subsequently having been sorted by coordinate order (use for instance biobambam2's bamsort utility to obtain this order).

A sample call is

generateperfectpiles out.las in.bam

generateperfectpiles has a set of optional parameters which can be set on the command line:

checklas

checklas compares an aligner produced LAS file with an LAS file containing perfect piles only and prints a statstic comparing both for each read. A sample call is

checklas in.db perfectpiles.las in.bam in.las

It requires four arguments

The file in.bam is expected to contain read names as produced by bamidrename. It has to be sorted by read name using biobambam2's bamsort utility.

Optional parameters are:

Statistic lines are printed on the standard output channel. A sample output line is

[R] 1941 missing 0 got 62 extra 0 erate 0.118001 U00096.2:409510,426938 0 1 1 1 180 regular

The columns are

The true rate values are computed as follows. For each trace block (usually of size 100) the (at most) seqdepth best (i.e. lowest number of errors) aligning reads are collected. Then the number n_t of true (contained in the ground truth) and n_f of false alignments in this list of seqdepth are counted. Based on this a fraction f = n_t / (n_t+n_f) is computed which designates how many of the (at most) seqdepth minimum error alignments are true. The program then outputs the minimum, average and maximum over all trace blocks for the read. In essence the minimum column should be 1 for any reads not stemming from repetetive regions.

checkconsensus

checkconsensus checks consensus accuracy by comparing consensus sequences to ground truth data. The program expects four arguments

The names of the sequences in ref.fasta must match the sequence lines in the header of reads.bam and the sequences need to appear in the same order in ref.fasta and the header of reads.bam. The read names in reads.bam need to follow the scheme described for bamidrename. reads.bam needs to have been sorted to query name order using biobambam2's bamsort. The read names in reads_cons.fasta need to follow the scheme used in the about of daccord as described above.

Before any other sub commands can be run the program needs to be run with teh index sub command. This computes indices for the various input files which will be used by the other sub command.

If the program is to be run on a single node, then the check sub command can be used subsequently to perform the accuracy comparison. Optional arguments are

For larger read sets the checking can be distributed over the nodes of a compute cluster. To this end the program can be run with the batchlist sub command. This will produce a shell script containing program calls which need to be called to obtain the same result as if the check sub command would have been called. An example call is

checkconsensus -p4 batchlist ref.fasta reads.bam reads_cons.fasta

This will produce a shell script similar to

checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000000
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000001
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000002
checkconsensus <opts> batchprocess ref.fasta reads.bam reads_cons.fasta reads_cons.fasta_check_000003
checkconsensus <opts> batchmerge ref.fasta reads.bam reads_cons.fasta
checkconsensus <opts> cleanup ref.fasta reads.bam reads_cons.fasta

using the sub commands batchprocess, batchmerge and cleanup (which should not be called directly). All the lines for batchprocess can be processed in parallel. After the batchprocess lines have been processed the batchmerge sub command will merge the resulting partial data and produce the output of the accuracy check. The cleanup sub command removes partial files produced by the batchprocess commands.

The batchlist sub command has the following optional parameters:

The output of checkconsensus contains several types of lines:

An example for a single read line is

6       AlignmentStatistics(matches=14732,mismatches=0,insertions=2,deletions=0,editdistance=2,erate=0.0001357405)      AlignmentStatistics(matches=112436,mismatches=0,insertions=10,deletions=0,editdistance=10,erate=0.0000889316)   [0,14731]       0:97757,112489

The columns are

An example for a coverage line is

[C]     16      13645   13642   0.99978

The columns are

An example of a global statistics line is

[G]     29460533        29447890        0.999571        AlignmentStatistics(matches=29447461,mismatches=195,insertions=2199,deletions=234,editdistance=2628,erate=0.0000892357)

The columns are

An example of an EM line is

[EM]    0.0004741958    18      0.0091047

The columns are

An example of an EP line is

[EP]    0.0000485201    669     0.338392

The columns are

sortfasta

sortfasta reads a FastA file from standard input, sorts it by read name and outputs the sorted data on standard output. Read names are compared in the same way as biobambam2's bamsort does for query name sorting.

mapconstoraw

mapconstoraw inserts corrected fragments into uncorrected reads by mapping the fragments onto the reads. It requires two arguments, both in FastA format:

Both of these input files are required to have been sorted using the sortfasta program.

The read names in the uncorrected reads file need to follow the scheme

read_{id}

where {id} is a numerical id. A valid example would be

read_5

The read names in the corrected read fragments file need to follow the scheme

read_{id}_{subid}

where {id} and {subid} are numerical values. A valid example would be

read_5_0

which would designate the first fragment (sub id 0) of read 5. The sub ids for each read need to be consecutive and start from 0.

mapconstoraw outputs a FastA file with the mapped corrected fragments inserted into the uncorrected reads. Corrected fragments are in upper case, uncorrected regions in lower case.

mapconstoraw has the following optional parameters:

fillfasta

fillfasta processes two sorted FastA files and computes their union such that the second file takes precedence (i.e. if a read name is present in both files, then the data from the second file will be kept and the one from the first one discarded). The read names contained in the second file need to be a subset of the read names found in the first file. The read names in both files need to follow the scheme

read_{id}

where {id} is a numerical id. A valid example would be

read_5

computeextrinsicqv

computeextrinsicqv produces a Dazzler database track providing the edit distance between a source read and the consensus produced for that read for each block of size tspace on the source read. It has two mandatory arguments. The mandatory arguments are a consensus file (as produced by daccord) in FastA format and the name of a Dazzler database (.db/.dam) file. A valid example would be

computeextrinsicqv data_fcgr/reads_cons.fasta data_fcgr/reads.dam

split_agr

split_agr performs agreement based read pile splitting for haplotypes and repeats. It expects four arguments

In addition the -d (average sequencing depth) parameter is mandatory.

A sample call is

split_agr -d20 out.las cons.fasta in.las in.db

The program requires an extrinsic quality value track for the input database. This track can be computed using the computeextrinsicqv program.

The following options can be used (no space between option name and parameter allowed):

split_dis

split_dis performs disagreement based read pile splitting for haplotypes and repeats. It expects four arguments

In addition the -d (average sequencing depth) parameter is mandatory.

A sample call is

split_dis -d20 out.las cons.fasta in.las in.db

The following options can be used (no space between option name and parameter allowed):

filterchains

filterchain processes an input LAS file and filters it using a consensus fragment list as produced by daccord. For each pair of reads the program chains up alignments using a greedy algorithm. Chains which overlap with the respective consensus sequences by less than a given number of bases (-l parameter) are removed from the output.

It expects three arguments

The following options can be used (no space between option name and parameter allowed):

A sample call is

filterchains -l5000 out.las cons.fasta in.las

wgsimtobam

wgsimtobam converts simulated reads produced by wgsim to a BAM file taking the positions into account and computing alignments between the reads and the designated reference region for each read.

It expects two arguments

A sample call is

wgsimtobam ref.fasta reads.fasta >reads.bam

The program treats the input file as single ended reads.