hic_qc.py
readme.This script is intended as a simple QC method for Hi-C libraries, based on reads in a BAM file aligned to some genome/assembly. For our full recommendations on aligning and QCing Hi-C data, please see here.
The most informative Hi-C reads are the ones that are long-distance contacts, or contacts between contigs of an assembly. This tool quantifies such contacts and makes plots of contact distance distributions. The most successful Hi-C libraries have many long-distance and among-contig contacts.
Hi-C connectivity drops off in approximately a power-law with increasing linear sequence distance. Consequently, one expects Hi-C reads to follow a characteristic distribution, wherein there is a spike of many read pairs at distances close to zero, which drops off smoothly (in log space) with increasing distance. If there are odd spikes or discontinuities, or if there are few long-distance contacts, there may be a problem either with the library or the assembly.
To install using conda, run the following commands in sequence (assumes you already have conda and git installed):
git clone https://github.com/phasegenomics/hic_qc.git
cd hic_qc/
conda env create -n hic_qc --file env.yml
conda activate hic_qc
python setup.py install --user
This will create a python 3.6 environment and automatically install the wkhtmltopdf
dependency. You may need to run a conda init
command to set up your shell if you haven't previously initialized conda in your shell.
To use a different python version, run:
conda --add channels bioconda --add channels conda-forge
conda create -q -n hic_qc python=$PYTHON_VERSION pysam numpy scipy matplotlib wkhtmltopdf pdfkit markdown
conda activate hic_qc
python setup.py install --user
For installation using pip, run this statement in a terminal:
git clone https://github.com/phasegenomics/hic_qc.git
cd hic_qc
pip install --user -r requirements.txt
python setup.py install --user
We include a requirements.txt
file with dependencies, which should be installed if you use the above command. However, if you want to use the PDF report feature of this tool, you will need to install wkhtmltopdf
externally, as we cannot install this readily.
This script has been verified on MacOSX, Ubuntu Linux, and Amazon Linux.
Some dependencies such as matplotlib don't play nicely with all pythons, such that some pythons in e.g. virtualenvs may not work. In that specific case you can just deactivate the virtualenv.
In the most basic use-case, you can run the script in a terminal
python hic_qc.py -b input.bam -n num_reads_to_use -r
where input.bam
is your BAM file from aligning Hi-C reads to your reference, and num_reads_to_use
is just the number of read pairs you want to sample from the BAM file (default 1 million read pairs; assuming there are this many reads in the file).
The script will write plots in PDF format to the file Read_mate_dist.pdf
in the working directory, unless -o
or --outfile
are set as described below.
The script will also quantify some basic QC metrics and print those to the screen.
The script can also make a full-on PDF report of those metrics with the plots embedded if you set the -r
/--make_report
flag.
To set the name of the files written out, such as the PNG figures and the report PDF, set the -o /path/to/outfile
or --outfile_name /path/to/outfile
parameters.
QC is performed using a set of thresholds in JSON format. By default, the file hic_qc/collateral/thresholds.json
is used. The chosen file may be changed with the --thresholds
flag. Note that the thresholds in the default file are informed by Phase Genomics' analysis of thousands of Hi-C libraries, and reflect what we ourselves use for QC.
Different QC thresholds may be present in a thresholds file. The default file includes thresholds for genome scaffolding projects and metagenome deconvolution projects. The --sample_type
argument is used to specify which set of thresholds in the thresholds file should be used in the run, and is also noted at the top of the report. By default, the genome
sample type is used. Additional sample types may be added to a thresholds JSON file by making them keys in the file.
The report includes a judgement about the library and the assembly it was mapped to based on the observed statistics, shown at the top of the report. Libraries are given one of four classifications:
IMPORTANT NOTE: because input assembly is a significant contributor to the ability to perform a given analysis, a good library can still generate a failed result with a bad assembly. This can particularly occur with difficult-to-align-to assemblies, such as polyploids or highly heterozygous/repetitive assemblies.
Three statistics are used to determine if a set of alignments has "good" aspects to it:
The entire set of alignments is deemed to have "good" properties if ALL THREE good criteria pass the threshold.
Three statistics are used to determine if a set of alignments has "bad" aspects to it:
Thresholds for these "good" and "bad" aspects come from the specified thresholds JSON file and are shown in the report. The fields in the report are highlighted for convenience to show whether a specific metric passed or failed the threshold. These are used to generate the judgement calls:
Histogram plots should show some characteristic features:
The collateral folder includes several histograms which serve as examples of what is expected for a good Hi-C library.
Problems observed in QC may indicate an issue either with the Hi-C reads or the assembly used for alignments. If the assembly is bad or e.g. comes from a distantly related organism or set of organisms, you should expect to see artifacts in the alignment of Hi-C reads.
matlock
has utilities for doing this.polar_star
can help you infer and break misjoins using long read data.