morispi / LRez

Standalone tool and library allowing to work with barcoded linked-reads
GNU Affero General Public License v3.0
12 stars 5 forks source link
10x 10x-genomics 10xgenomics barcode barcodes bioinformatics haplotagging index linked linked-reads reads stlfr tell-seq

LRez

LRez provides a standalone tool allowing to work with barcoded linked-reads such as 10X Genomics data, as well as library allowing to easily use it in other projects.

Presently, it is directly compatible with the following linked-reads technologies, given the barcodes are reported using the BX:Z tag (if this is not the case, pre-processing scripts are given in the utils/ directory):

LRez has different functionalities such as comparing regions pairs or contigs extremities to retrieve their common barcodes and extracting barcodes from given regions of a BAM file, as well as indexing and querying both BAM and FASTQ files to quickly retrieve reads or alignments sharing a given barcode or list of barcodes. In can thus be used in different applications, such as variant calling or scaffolding.

Requirements

Installation from source

Clone the LRez repository, along with its submodules with:

  git clone --recursive https://github.com/morispi/LRez

Then run the install.sh script:

  ./install.sh

The installation script will build dependencies, the binary standalone in the bin folder, as well as the library, allowing to use LRez in other projects, in the lib folder.

Installation from conda

Alternatively, LRez is also distributed as a bioconda package, which can be installed with:

conda install -c bioconda lrez

Using the toolkit

Usage

LRez [SUBCOMMAND]

where [SUBCOMMAND] can be one of the following:

Subcommands

A description of each subcommand as well as its options is given below.

Compare

LRez compare allows to compute the number of common barcodes between all possibles pairs of a given list of regions, or between a given contig's extremities and all other contigs' extremities.

  --bam STRING, -b STRING:      BAM file containing the alignments
  --index STRING, -i SRING:     Barcodes offsets index built with the index bam subcommand
  --region STRING, -r STRING:   File containing regions of interest in format chromosome:startPosition-endPosition
  --contig STRING, -c STRING:   Contig of interest
  --contigs STRING, -c STRING:  File containing a list of contigs of interest
  --size INT, -s INT:           Size of contigs' extremities to consider (optional, default: 1000) 
  --output STRING, -o STRING:   File where to output the results (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Extract

LRez extract allows to extract the list of barcodes in a given region of a BAM file.

  --bam STRING, -b STRING:      BAM file to extract barcodes from
  --region STRING, -r STRING:   Region of interest in format chromosome:startPosition-endPosition
  --all, -a:                    Extract all barcodes
  --output STRING, -o STRING:   File where to output the extracted barcodes (optional, default: stdout)
  --duplicates, -d:             Include duplicate barcodes (optional, default: false)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Stats

LRez stats allows to retrieve general stats from the BAM file.

  --bam STRING, -b STRING:      BAM file to extract barcodes from
  --regions INT, -r INT:        Number of regions to consider to define stats (optional, default: 1000)
  --size INT, -s INT:           Size of the regions to consider (optional, default: 1000)
  --output STRING, -o STRING:   File where to output the extracted barcodes (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Index BAM

LRez index bam allows to index the offsets or occurrences positions of the barcodes contained in a BAM file.

  --bam STRING, -b STRING:      BAM file to index
  --output STRING, -o STRING:   File where to store the index
  --offsets, -f:                Index the offsets of the barcodes in the BAM file
  --positions, -p:              Index the (chromosome, begPosition) occurrences positions of the barcodes
  --primary, -r:                Only index barcodes that appear in a primary alignment (optional, default: false)
  --quality INT, -q INT:        Only index barcodes that appear in an alignment of quality higher than this number (optional, default: 0)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Query BAM

LRez query bam allows to query a barcodes index and a BAM file to retrieve alignments containing the query barcodes.

  --bam STRING, -b STRING:      BAM file to search
  --index STRING, -i STRING:    Barcodes offsets index, built with the index bam subcommand, using the -f option.
  ---query STRING, -q STRING:   Query barcode to search in the BAM / index
  --list STRING, -l STRING:     File containing a list of barcodes to search in the BAM / index
  --output STRING, -o STRING:   File where to output the extracted alignments (optional, default: stdout)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Index fastq

LRez index fastq allows to index the offsets of the barcodes contained in a fastq file.

  --fastq STRING, -f STRING:    Fastq file to index
  --output STRING, -o STRING:   File where to store the index
  --gzip, -g:                   Fastq file is gzipped (optional, default: false)
  --threads INT, -t INT:        Number of threads to use (optional, default: 1)

Query fastq

LRez query fastq allows to query a barcodes index and a fastq file to retrieve alignments containing the query barcodes.

  --fastq STRING, -f STRING:                Fastq file to search
  --index STRING, -i STRING:                Barcodes index, built with the index fastq subcommand
  --query STRING, -q STRING:                Query barcode to search in the fastq file and the index
  --list STRING, -l STRING:                 File containing a list of barcodes to search in the fastq file and the index
  --collectionOfLists STRING, -c STRING:    File of files (FOF) e.g. file containing files' names of lists of barcodes to search in the fastq file and the index
  --output STRING, -o STRING:               File where to output the extracted reads (optional, default: stdout)
  --gzip, -g:                               Fastq file is gzipped (optional, default: false)
  --threads INT, -t INT:                    Number of threads to use (optional, default: 1)

Using the API

Complete documentation of the different API functions is provided at https://morispi.github.io/LRez/files.html. Additionnal information and usage examples are provided on the Wiki page https://github.com/morispi/LRez/wiki.

Notes

LRez has been developed and tested on x86-64 GNU/Linux.
Support for any other platform has not been tested.

Authors

Pierre Morisse, Claire Lemaitre and Fabrice Legeai.

Reference

Pierre Morisse, Claire Lemaitre, Fabrice Legeai. LRez: C++ API and toolkit for analyzing and managing Linked-Reads data. Bioinformatics Advances, vbab022, https://doi.org/10.1093/bioadv/vbab022

Contact

You can report problems and bugs as issues on this repository : https://github.com/morispi/LRez/issues