grenaud / SAFARI

Sensitive Alignments from a RYmer Index
Other
1 stars 0 forks source link

SAFARI

Overview

vg SAFARI (Sensitive Alignments from a RYmer Index) is a modified version of the subcommand vg giraffe from the vg toolkit (https://github.com/vgteam/vg). SAFARI is modified specifically to recover more alignments from ancient DNA samples, which suffer from characteristic substitution patterns due to chemical damage. SAFARI is a modification of giraffe from a frozen version of vg (Solara, version 1.44).

Compilation

git clone --recurse-submodules https://github.com/grenaud/SAFARI && cd SAFARI && git submodule update --init --recursive && (cd deps/sparsehash && ./configure) && ./source_me.sh && make -j [threads]

Basic Options

Input Options

Alternate Indexes

Output Options

Algorithm Presets

Computational Parameters

Damage Matrix (.prof) File Format

SAFARI requires to have initial estimates of the damage rates for the 5' and 3' end of the aDNA fragments. However, this is difficult to obtain a priori. There are a ways you can solve this:

1) Initial damage profile estimates can be obtained by first aligning to a linear reference and then running bam2prof (https://github.com/grenaud/bam2prof).

2) Use damage rates previously estimated on other samples. We have provided in the profs directory a number of such profiles from high-visibility papers for samples of various ages sequenced with various library protocols. Of course, preservation conditions will affect these profiles as well.

To illustrate the format an example damage matrix file is provided below for the 3' end.

A>C     A>G     A>T     C>A     C>G     C>T     G>A     G>C     G>T     T>A     T>C     T>G
0       0       0       0       0       0       0.32891 0       0       0       0       0
0       0       0       0       0       0       0.223405        0       0       0       0       0
0       0       0       0       0       0       0.188599        0       0       0       0       0
0       0       0       0       0       0       0.164419        0       0       0       0       0
0       0       0       0       0       0       0.146352        0       0       0       0       0

For the remainder of the molecule, the value of the last line will be copied over onto the remaining positions. e.g. in the example above, we only have 5 lines and at 6 bp away from the 3' end, 0.146352 will be used as the rate of G->A substitution.

Quickstart

First, get the test files:

wget -nc -r -l1 --no-parent -nH --cut-dirs=2 -P SAFARI_graph ftp://ftp.healthtech.dtu.dk:/public/SAFARI_graph/

Then create the index files:

bin/vg minimizer -d SAFARI_graph/hominin.dist -g SAFARI_graph/hominin.gbwt -t [# threads] -p -o SAFARI_graph/hominin.min SAFARI_graph/hominin.og

bin/vg rymer -d SAFARI_graph/hominin.dist -g SAFARI_graph/hominin.gbwt -t [# threads] -p -o SAFARI_graph/hominin.ry SAFARI_graph/hominin.og

Then run:

bin/vg safari -f test/SAFARI/reads.fq.gz
-m SAFARI_graph/hominin.min
-q SAFARI_graph/hominin.ry
-Z SAFARI_graph/hominin.giraffe.gbz
-d SAFARI_graph/hominin.dist
--deam-3p test/SAFARI/dhigh3p.prof
--deam-5p test/SAFARI/dhigh3p.prof > SAFARI_test.gam

To check the GAM file, you can do


bin/vg stats -a SAFARI_test.gam

How to get a bam file?

You can obtain a bam file using the command "surject" on the resulting gam file from the vg toolkit (https://github.com/vgteam/vg).

Contact

For questions, contact Joshua Rubin (jdru@dtu.dk) or Gabriel Renaud (gabriel.reno@gmail.com)