jluebeck / FaNDOM

Fast Nested Distance aligner for Optical Maps
Other
3 stars 1 forks source link

FaNDOM

Fast Nested Distance Aligner for Optical Maps

About

FaNDOM performs alignment of Bionano Saphyr optical map molecules and contigs to a reference, using a seed-based filter. FaNDOM is implemented in C++ and supports multithreading.

FaNDOM is developed by Siavash Raeisi Dehkordi and Jens Luebeck.

Installation

FaNDOM requires cmake 3.1 or higher. This should already be satisfied on most modern Unix systems. To check your cmake version, type cmake --version. FaNDOM has been tested on Ubuntu >=16.04, Mac OSX and CentOS 7.

To install FaNDOM, do

git clone https://github.com/jluebeck/FaNDOM
cd FaNDOM
cmake CMakeLists.txt && make 

Python SV module requirements:

FaNDOM SV detection scripts require python3 and the numpy package.

Usage

FaNDOM takes as input Bionano Saphyr molecules stored in .bnx or .cmap form or assembled contigs in .cmap form, and a pre-processed reference genome. Right now, FaNDOM supports GRCh37(hg19) and GRCh38 available in reference_genomes folder and anslo non-human reference genomes. FaNDOM outputs alignments of the OM molecules in FaNDOM's .fda or .xmap file format.

Command line arguments

Required arguments
Optional arguments (basic)
Optional arguments (advanced)

To ensure you installed FaNDOM correctly, in the FaNDOM directory run the following command:

./FaNDOM -t=1 -r=test_data/reference.cmap -q=test_data/query.cmap -sname=test_data/res -outfmt=xmap

Wrapper for SV analysis of assembled contig data

To run the pipeline for detecting SVs on assembled contigs, use the python script in the "Pythonscript" folder, wrapper_contigs.py

Output files from this process: The output of this pipeline is stored in the -o directory. 'SV.txt' Contains the structural variant calls, 'indel.txt' contains indel calls and alignment file ending with 'final_alignment.xmap' contains final alignment file. An example command:

python PythonScript/wrapper_contigs.py -f $PWD -t 1 -r test_data/reference.cmap -q test_data/query.cmap -n res -o $PWD/test_data -c 19 -m 1

This should run the SV pipeline for simple datasets and save the results in the test_data/res directory.

Wrapper for SV analysis of unassembled molecule data

To run whole the pipeline for detecting SVs on raw molecules, use the python script in the "Pythonscript" folder, wrapper_individual.py. This wrapper needs near 100GB of RAM(depends on number of molecules) to call the SV_finder.

Output files from this process: The output of this pipeline is stored in the in -o directory. It produces two folders named 'molecules' and 'alignments'. 'molecules' contains split molecules and 'alignments' contains molecule alignments and SVs. In the 'alignments' folder there is a file named 'final_alignment.xmap' containing all molecule alignments. 'SV.txt' contains the structural variants call. An example command:

python PythonScript/wrapper_individual.py -f $PWD -t 10 -r referencehg38.cmap -q query.bnx -n test_molecules -o $PWD/output/ -c 38 -m 1

Video Tutorial

FaNDOM

Python scripts

The following scripts are used inside the SV wrapper - wrapper_contigs.py, and can be invoke separately if desired.

Preprocess_reference.py script

This script used for creating processed reference genome for FaNDOM. We highly recommend that if you want to use FaNDOM with non-human reference genome preprocessed your reference genome with this script. It merged close labels with each other.

As an example:

python preprocess.py -q H460_DLE1_EXP_REFINEFINAL1.cmap -o /Output/processed2 -m 50

post_process.py script

This script used for remap aligments to first molecule file. For doing that you need a file ending with 'dic' that preprocess.py script made.

remove_part.py script

This script is used for removing partial alignments from full alignments

SV_detection_contigs.py script

This script used for detecting potential integration points.

After running Preprocess_reference.py script, for masking out low complexity regions, please do as follows:

There is a built-in tool in RefAligner which you can use it as follows:

RefAligner -i refrence_genome.cmap -o filtered_reference_genome -simpleRepeatStandalone -simpleRepeatTolerance 0.1 -simpleRepeatMinEle 5 -simpleRepeatFilter 3

where these parameters are used for: Parameters:

.fda file format

In addition to producing .xmap formatted alignments, we support an alternate file format for FaNDOM output, with a more informative CIGAR string than XMAP. Each alignment entry contains four lines (as defined in the header):

#0      ref_id  mol_id  aln_direction   ref_start_pos   ref_end_pos     mol_start_pos   mol_end_pos     mol_length
#1      total_score     mean_score      is_multimapped  is_secondary    aln_seed_num
#2      alignment [aln_index]:(ref_pos, mol_pos, mol_lab, score_delta)
#3      cigar [aln_index]:(delta_ref, delta_mol, mol_label_diff, delta_difference)

is_secondary is set to True if the molecule is multimapped (is_multimapped = True) and another alignmnet of the molecule in the file has a higher total_score.

The cigar field specifies a list of tuples (tagged by the number in the alignment, starting at 0), with the following definitions: