open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link
3d-genome bioinformatics file-formatter hi-c ngs pairs-file python

pairtools

Documentation Status Build Status Join the chat on Slack DOI

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

To get started:

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

pairtools produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks. The full list of possible extra columns is provided in the pairtools format specification.

Installation

Requirements:

For the full list of recommended versions, see requirements in the the GitHub repo.

We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel.

$ conda install -c conda-forge -c bioconda pairtools

Alternatively, install non-Python dependencies and pairtools with Python-only dependencies from PyPI using pip:

$ pip install numpy pysam cython
$ pip install pairtools

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:

$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam 

Inspect the resulting table:

$ less MATalpha_R1.pairs.gz

Pipelines

Tools

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/open2c/pairtools.git
$ cd pairtools
$ pip install -e .

Citing pairtools

Open2C, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389

License

MIT