phasegenomics / hic_qc

A (very) simple script to QC Hi-C data.
GNU Affero General Public License v3.0
25 stars 5 forks source link

Build Status

hic_qc.py readme.

This script is intended as a simple QC method for Hi-C libraries, based on reads in a BAM file aligned to some genome/assembly. For our full recommendations on aligning and QCing Hi-C data, please see here.

The most informative Hi-C reads are the ones that are long-distance contacts, or contacts between contigs of an assembly. This tool quantifies such contacts and makes plots of contact distance distributions. The most successful Hi-C libraries have many long-distance and among-contig contacts.

Hi-C connectivity drops off in approximately a power-law with increasing linear sequence distance. Consequently, one expects Hi-C reads to follow a characteristic distribution, wherein there is a spike of many read pairs at distances close to zero, which drops off smoothly (in log space) with increasing distance. If there are odd spikes or discontinuities, or if there are few long-distance contacts, there may be a problem either with the library or the assembly.

Dependencies

Installation

Conda installation

To install using conda, run the following commands in sequence (assumes you already have conda and git installed):

git clone https://github.com/phasegenomics/hic_qc.git
cd hic_qc/
conda env create -n hic_qc --file env.yml
conda activate hic_qc
python setup.py install --user

This will create a python 3.6 environment and automatically install the wkhtmltopdf dependency. You may need to run a conda init command to set up your shell if you haven't previously initialized conda in your shell.

To use a different python version, run:

conda --add channels bioconda --add channels conda-forge
conda create -q -n hic_qc python=$PYTHON_VERSION pysam numpy scipy matplotlib wkhtmltopdf pdfkit markdown
conda activate hic_qc
python setup.py install --user

Pip installation

For installation using pip, run this statement in a terminal:

git clone https://github.com/phasegenomics/hic_qc.git
cd hic_qc
pip install --user -r requirements.txt
python setup.py install --user

We include a requirements.txt file with dependencies, which should be installed if you use the above command. However, if you want to use the PDF report feature of this tool, you will need to install wkhtmltopdf externally, as we cannot install this readily.

This script has been verified on MacOSX, Ubuntu Linux, and Amazon Linux.

Some dependencies such as matplotlib don't play nicely with all pythons, such that some pythons in e.g. virtualenvs may not work. In that specific case you can just deactivate the virtualenv.

Usage

In the most basic use-case, you can run the script in a terminal

python hic_qc.py -b input.bam -n num_reads_to_use -r

where input.bam is your BAM file from aligning Hi-C reads to your reference, and num_reads_to_use is just the number of read pairs you want to sample from the BAM file (default 1 million read pairs; assuming there are this many reads in the file).

The script will write plots in PDF format to the file Read_mate_dist.pdf in the working directory, unless -o or --outfile are set as described below.

The script will also quantify some basic QC metrics and print those to the screen.

The script can also make a full-on PDF report of those metrics with the plots embedded if you set the -r/--make_report flag.

To set the name of the files written out, such as the PNG figures and the report PDF, set the -o /path/to/outfile or --outfile_name /path/to/outfile parameters.

QC is performed using a set of thresholds in JSON format. By default, the file hic_qc/collateral/thresholds.json is used. The chosen file may be changed with the --thresholds flag. Note that the thresholds in the default file are informed by Phase Genomics' analysis of thousands of Hi-C libraries, and reflect what we ourselves use for QC.

Different QC thresholds may be present in a thresholds file. The default file includes thresholds for genome scaffolding projects and metagenome deconvolution projects. The --sample_type argument is used to specify which set of thresholds in the thresholds file should be used in the run, and is also noted at the top of the report. By default, the genome sample type is used. Additional sample types may be added to a thresholds JSON file by making them keys in the file.

Library judgement and thresholds

Judgement categories

The report includes a judgement about the library and the assembly it was mapped to based on the observed statistics, shown at the top of the report. Libraries are given one of four classifications:

IMPORTANT NOTE: because input assembly is a significant contributor to the ability to perform a given analysis, a good library can still generate a failed result with a bad assembly. This can particularly occur with difficult-to-align-to assemblies, such as polyploids or highly heterozygous/repetitive assemblies.

"Good" properties

Three statistics are used to determine if a set of alignments has "good" aspects to it:

The entire set of alignments is deemed to have "good" properties if ALL THREE good criteria pass the threshold.

"Bad" properties

Three statistics are used to determine if a set of alignments has "bad" aspects to it:

Generating the final judgement

Thresholds for these "good" and "bad" aspects come from the specified thresholds JSON file and are shown in the report. The fields in the report are highlighted for convenience to show whether a specific metric passed or failed the threshold. These are used to generate the judgement calls:

Histogram plot characteristics to look for

Histogram plots should show some characteristic features:

Example histograms

The collateral folder includes several histograms which serve as examples of what is expected for a good Hi-C library.

Statistics reported

What do I do if there is a problem?

Problems observed in QC may indicate an issue either with the Hi-C reads or the assembly used for alignments. If the assembly is bad or e.g. comes from a distantly related organism or set of organisms, you should expect to see artifacts in the alignment of Hi-C reads.