pairtools
is a simple and fast command-line framework to process sequencing
data from a Hi-C experiment.
pairtools
process pair-end sequence alignments and perform the following
operations:
To get started:
pairtools
produce and operate on tab-separated files compliant with the
.pairs
format defined by the 4D Nucleome Consortium. All
pairtools properly manage file headers and keep track of the data
processing history.
Additionally, pairtools
define the .pairsam format, an extension of .pairs that includes the SAM alignments
of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that
operates on .pairs files.
pairtools
produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks.
The full list of possible extra columns is provided in the pairtools
format specification.
Requirements:
cython
, pysam
, bioframe
, pyyaml
, numpy
, scipy
, pandas
and click
.sort
(the Unix version), bgzip
(shipped with samtools
) and samtools
. If available, pairtools
can compress outputs with pbgzip
and lz4
.For the full list of recommended versions, see requirements in the the GitHub repo.
We highly recommend using the conda
package manager to install pairtools
together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.
With conda
, you can install pairtools
and all of its dependencies from the bioconda channel.
$ conda install -c conda-forge -c bioconda pairtools
Alternatively, install non-Python dependencies and pairtools
with Python-only dependencies from PyPI using pip:
$ pip install numpy pysam cython
$ pip install pairtools
Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:
$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam
Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:
$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes
With pairtools parse
, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:
$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam
Inspect the resulting table:
$ less MATalpha_R1.pairs.gz
pairtools
and nextflow.parse
: read .sam/.bam files produced by bwa and form Hi-C pairs
parse2
: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks
sort
: sort pairs files (the lexicographic order for chromosomes,
the numeric order for the positions, the lexicographic order for pair types).
merge
: merge sorted .pairs files
select
: select pairs according to specified criteria
dedup
: remove PCR duplicates from a sorted triu-flipped .pairs file
maskasdup
: mark all pairs in a pairsam as Hi-C duplicates
split
: split a .pairsam file into .pairs and .sam.
flip
: flip pairs to get an upper-triangular matrix
header
: manipulate the .pairs/.pairsam header
stats
: calculate various statistics of .pairs files
restrict
: identify the span of the restriction fragment forming a Hi-C junction
phase
: phase pairs mapped to a diploid genome
Pull requests are welcome.
For development, clone and install in "editable" (i.e. development) mode with the -e
option. This way you can also pull changes on the fly.
$ git clone https://github.com/open2c/pairtools.git
$ cd pairtools
$ pip install -e .
pairtools
Open2C, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389
MIT