mdozmorov / HiC_tools

A collection of tools for Hi-C data analysis
MIT License
514 stars 118 forks source link
3d-genome chromatin hi-c

Hi-C data analysis tools and papers

MIT License PR's Welcome

Tools are added by publication date, newest on top. Unpublished tools are listed at the end of each section. See Hi-C data notes and single-cell Hi-C notes for more. Please, contribute and get in touch! See MDnotes for other data science and genomics-related notes.

Table of content

Pipelines

QC, quality control

Capture-C

Capture-C peaks

HiChIP

4C

Resolution improvement

  • HIFI - Command-line tool for Hi-C Interaction Frequency Inference for restriction fragment-resolution analysis of Hi-C data. Sparsity is resolved by using dependencies between neighboring restriction fragments, with Markov Random Fields performing the best. Better resolves TADs and sub-TADs, significant interactions. CTCF, RAD21, SMC3, ZNF143 are enriched around TAD boundaries. Matrices normalized for fragment-specific biases.

    Paper
    • Cameron, Christopher JF, Josée Dostie, and Mathieu Blanchette. "Estimating DNA-DNA Interaction Frequency from Hi-C Data at Restriction-Fragment Resolution" https://doi.org/10.1186/s13059-019-1913-y Genome Biology, 14 January 2020
  • hicGAN - improving resolution (saturation) of Hi-C data using Generative Adversarial Networks. Generator - five inner residual blocks to fight vanishing gradient (each block has two convolutional layers and batch normalization) and an outer skip connection. The discriminator has three convolutional blocks. Evaluation metrics: MSE, signal-to-noise ratio, structure similarity index, chromatin loop score. Compared against HiCPlus. Python, Tensorflow implementation.

    Paper

    Liu, Qiao, Hairong Lv, and Rui Jiang. "HicGAN Infers Super Resolution Hi-C Data with Generative Adversarial Networks" https://doi.org/10.1093/bioinformatics/btz317 Bioinformatics 35, no. 14 (July 15, 2019)

  • HiCNN - a computational method for resolution enhancement. A modification of the HiCPlus approach, using very deep (54 layers, five types of layers) convolutional neural network. A Hi-C matrix of regular resolution is transformed into the high-resolution but very sparse matrix, HiCNN predicts the missing values. Pearson and MSE evaluation metrics, overlap of Fit-Hi-C-detected significant interactions - perform similar or slightly better than HiCPlus. PyTorch implementation.

    Paper

    Liu, Tong, and Zheng Wang. "HiCNN: A Very Deep Convolutional Neural Network to Better Enhance the Resolution of Hi-C Data" https://doi.org/10.1093/bioinformatics/btz251 Bioinformatics, April 9, 2019

  • Boost-HiC - infer fine-resolution contact frequencies in Hi-C data, performs well even on 0.1% of the raw data. TAD boundaries remain. Better than HiCPlus. It can be used for differential analysis (comparison) of two Hi-C maps.

    Paper

    Carron, Leopold, Jean-baptiste Morlot, Vincent Matthys, Annick Lesne, and Julien Mozziconacci. "Boost-HiC: Computational Enhancement of Long-Range Contacts in Chromosomal Contact Maps" https://doi.org/10.1101/471607 November 18, 2018.

  • mHi-C - recovering alignment of multi-mapped reads in Hi-C data. Generative model to estimate probabilities for each bin-pair originating from a given origin. Reproducibility of contact matrices (stratum-adjusted correlation), reproducibility and number of significant interactions are improved. Novel interactions. Enrichment of TAD boundaries in LINE and SINE repetitive elements. Multi-mapping is not sensitive to trimming. Read filtering strategy (Figure 1, supplementary figures are very visual).

    Paper

    Zheng, Ye, Ferhat Ay, and Sunduz Keles. "Generative Modeling of Multi-Mapping Reads with MHi-C Advances Analysis of High Throughput Genome-Wide Conformation Capture Studies" https://doi.org/10.1101/301705 October 3, 2018.

  • HiCPlus - increasing resolution of Hi-C data using convolutional neural network, mean squared error as a loss function. Basically, smoothing parts of Hi-C image, then binning into smaller parts. Performs better than bilinear/biqubic smoothing.

    Paper

    Zhang, Yan, Lin An, Ming Hu, Jijun Tang, and Feng Yue. "HiCPlus: Resolution Enhancement of Hi-C Interaction Heatmap" https://doi.org/10.1038/s41467-018-03113-2 March 1, 2017.

  • Simulation

    Normalization

  • HiCcompare - R/Bioconductor package for joint normalization of two Hi-C datasets using loess regression through an MD plot (minus-distance). Data-driven normalization accounting for the between-dataset biases. Per-distance permutation testing of significant interactions.

    Paper

    Stansfield, John C., Kellen G. Cresswell, Vladimir I. Vladimirov, and Mikhail G. Dozmorov. "HiCcompare: An R-Package for Joint Normalization and Comparison of HI-C Datasets" https://doi.org/10.1186/s12859-018-2288-x BMC Bioinformatics 19, no. 1 (December 2018).

  • HiFive - handling and normalization or pre-aligned Hi-C and 5C data.

    Paper

    Sauria, Michael EG, Jennifer E. Phillips-Cremins, Victor G. Corces, and James Taylor. "HiFive: A Tool Suite for Easy and Efficient HiC and 5C Data Analysis" https://doi.org/10.1186/s13059-015-0806-y Genome Biology 16, no. 1 (December 2015). - HiFive - post-processing of aligned Hi-C and 5C data, three normalization approaches: "Binning" - model-based Yaffe & Tanay's method, "Express" - matrix-balancing approach, "Probability" - multiplicative probability model. Judging normalization quality by the correlation between matrices.

  • HiCNorm - removing known biases in Hi-C data (GC content, mappability, fragment length) via Poisson regression.

    Paper

    Hu, Ming, Ke Deng, Siddarth Selvaraj, Zhaohui Qin, Bing Ren, and Jun S. Liu. "HiCNorm: Removing Biases in Hi-C Data via Poisson Regression" https://doi.org/10.1093/bioinformatics/bts570 Bioinformatics (Oxford, England) 28, no. 23 (December 1, 2012) - Poisson normalization. Also tested negative binomial.

  • CNV-aware normalization

    Reproducibility

  • QuASAR - Hi-C quality and reproducibility measure using spatial consistency between local and regional signals. Finds the maximum useful resolution by comparing quality and replicate scores of replicates. Part of the HiFive pipeline.

    Paper

    Sauria, Michael EG, and James Taylor. "QuASAR: Quality Assessment of Spatial Arrangement Reproducibility in Hi-C Data" https://doi.org/10.1101/204438 BioRxiv, November 14, 2017.

  • HiCRep - Similarity assessment using generalized Cochran-Mantel-Haenzel statistics M2. Spearman/Pearson doesn't work. 2-step procedure: Smooth the matrix, then CMH statistics. Basically, splitting data by distance chunks, Pearson on each chunk, summarize. Simple and well-thought stats. Methods: Hi-C datasets with replicates, including 11 ENCODE datasets. R package, and Python implementation

    Paper

    Yang, Tao, Feipeng Zhang, Galip Gurkan Yardimci, Ross C Hardison, William Stafford Noble, Feng Yue, and Qunhua Li. "HiCRep: Assessing the Reproducibility of Hi-C Data Using a Stratum-Adjusted Correlation Coefficient](https://genome.cshlp.org/content/27/11/1939.long Genome Research, August 30, 2017

  • HiC-Spector - reproducibility metric to quantify the similarity between contact maps using spectral decomposition. Decomposing Laplacian matrices and sum the Euclidean distance between eigenvectors.

    Paper

    Yan, Koon-Kiu, Galip Gürkan Yardimci, Chengfei Yan, William S. Noble, and Mark Gerstein. "HiC-Spector: A Matrix Library for Spectral and Reproducibility Analysis of Hi-C Contact Maps" https://doi.org/10.1093/bioinformatics/btx152 Bioinformatics (Oxford, England) 33, no. 14 (July 15, 2017)

  • AB compartments

  • Calder - multi-scale compartment and sub-compartment detection, improvement over dichotomous AB compartment detection. Clustering contact similarities (Fisher's z-transformed correlations) into high intra and low inter-region similarities, followed by a divisive hierarchical clustering within each domain. The likelihood of nested sub-domains can be estimated using a mixture log-normal distribution. Detailed methods, complex. Eight subcompartments, 4 within the A and 4 within the B compartment, balanced set, in contrast to SNIPER. Expected associations with active/inactive genomic annotations. Nested compartments may be associated with TADs/loops. Analysis of domain repositioning across 114 cell lines. 40kb resolution. R package, named after Alexander Calder, an American sculptor. Supplementary Data 1 - IDs and links to Hi-C, ChIP-seq, and RNA-seq datasets; Data 2 - hg19 BED files of Complete domain hierarchies inferred by CALDER from 127 Hi-C contact maps; Data 7 - coordinates of Repositioned compartment domains between normal and cancer cell lines derived from breast, prostate, and pancreatic tissue samples.

    Paper
    • Liu, Yuanlong, et al. "Systematic Inference and Comparison of Multi-Scale Chromatin Sub-Compartments Connects Spatial Organization to Cell Phenotypes" https://doi.org/10.1038/s41467-021-22666-3 Nature Communication, 10 May 2021
  • dcHiC - differential A/B compartment analysis of Hi-C data. Uses Multiple Factor Analysis (MFA), and extension of PCA which combines Hi-C maps before performing generalized PCA. Analogous to weighted PCA in which every dataset is normalized for its biases (Methods). Multivariate distance measure to estimate statistical significance of compartment differences. Applied to mouse neuronal differentiation, mouse hematopoietic system, human cell Hi-C data. Gene enrichment analysis shows biologically relevant signal. Input - sparse matrix, hic, cool files.

    Paper

    Wang, Jeffrey, Abhijit Chakraborty, and Ferhat Ay. "DcHiC: Differential Compartment Analysis of Hi-C Datasets" https://doi.org/10.1101/2021.02.02.429297 BioRxiv, January 1, 2021

  • SNIPER - 3D subcompartment (A1, A2, B1, B2, B3) identification from low-coverage Hi-C datasets. A neural network based on a denoising autoencoder (9 layers) and a multi-layer perceptron. Sigmoidal activation of inputs, ReLU, softmax on outputs. Dropout, binary cross-entropy. exp(-1/C) transformation of Hi-C matrices. Applied to Gm12878 and 8 additional cell types to compare subcompartment changes. Compared with Rao2014 annotations, outperforms Gaussian HMM and MEGABASE.

    Paper

    Xiong, Kyle, and Jian Ma. "Revealing Hi-C Subcompartments by Imputing High-Resolution Inter-Chromosomal Chromatin Interactions" https://doi.org/10.1038/s41467-019-12954-4 Nature Communications, 07 November 2019

  • CScoreTool - AB compartment detection, fast and memory-efficient C++ tool, operates on data with low sequencing depth (benchmarked against HOMER). In contrast to PCA, uses a log-likelihood function, MLE for parameter estimation. C-scores can be directly compared.

    Paper

    Zheng, Xiaobin, and Yixian Zheng. “CscoreTool: Fast Hi-C Compartment Analysis at High Resolution.” Edited by John Hancock. Bioinformatics 34, no. 9 (May 1, 2018): 1568–70. https://doi.org/10.1093/bioinformatics/btx802.

  • Eigenvector - Juicer's native tool. The eigenvector can be used to delineate compartments in Hi-C data at coarse resolution; the sign of the eigenvector typically indicates the compartment. The eigenvector is the first principal component of the Pearson's matrix.

  • Peak/Loop callers

    Differential analysis

  • Selfish - comparative analysis of replicate Hi-C experiments via a self-similarity measure - local similarity borrowed from image comparison. Check reproducibility, detect differential interactions. Boolean representation of contact matrices for reproducibility quantification. Deconvoluting local interactions with a Gaussian filter (putting a Gaussian bell around a pixel), then comparing derivatives between contact maps for each radius. Simulated (Zhou method) and real comparison with FIND - better performance, especially on low fold-changes. Stronger enrichment of relevant epigenomic features. Matlab implementation.

    Paper

    Roayaei Ardakany, Abbas, Ferhat Ay, and Stefano Lonardi. "Selfish: Discovery of Differential Chromatin Interactions via a Self-Similarity Measure" https://doi.org/10.1093/bioinformatics/btz362 Bioinformatics, July 2019

  • multiHiCcompare - R/Bioconductor package for joint normalization of multiple Hi-C datasets using cyclic loess regression through pairs of MD plots (minus-distance). Data-driven normalization accounting for the between-dataset biases. Per-distance edgeR-based testing of significant interactions.

    Paper

    Stansfield, John C, Kellen G Cresswell, and Mikhail G Dozmorov. "MultiHiCcompare: Joint Normalization and Comparative Analysis of Complex Hi-C Experiments" https://doi.org/10.1093/bioinformatics/btz048 Bioinformatics, January 22, 2019

  • Chicdiff - differential interaction detection in Capture Hi-C data. Signal normalization based on the CHiCAGO framework, differential testing using DESeq2. Accounting for distance effect by the Independent Hypothesis Testing (IHW) method to learn p-value weights based on the distance to maximize the number of rejected null hypotheses.

    Paper

    Cairns, Jonathan, William R. Orchard, Valeriya Malysheva, and Mikhail Spivakov. "Chicdiff: A Computational Pipeline for Detecting Differential Chromosomal Interactions in Capture Hi-C Data" https://doi.org/10.1101/526269 BioRxiv, January 1, 2019

  • HiCcompare - R/Bioconductor package for joint normalization of two Hi-C datasets using loess regression through an MD plot (minus-distance). Data-driven normalization accounting for the between-dataset biases. Per-distance permutation testing of significant interactions.

    Paper

    Stansfield, John C., Kellen G. Cresswell, Vladimir I. Vladimirov, and Mikhail G. Dozmorov. "HiCcompare: An R-Package for Joint Normalization and Comparison of HI-C Datasets" https://doi.org/10.1186/s12859-018-2288-x BMC Bioinformatics 19, no. 1 (December 2018)

  • FIND - differential chromatin interaction detection comparing the local spatial dependency between interacting loci. Previous strategies - simple fold-change comparisons, binomial model (HOMER), count-based edgeR. FIND exploits a spatial Poisson process model to detect differential chromatin interactions that show a significant change in their interaction frequency and the interaction frequency of their adjacent bins. "Variogram" concept. For each point, compare densities between conditions using Fisher's test. Explored various multiple correction testing methods, used r^th ordered p-values (rOP) method. Benchmarking against edgeR in simulated settings - FIND outperforms at shorter distances, edgeR has more false positives at longer distances. Real Hi-C data normalized using KR and MA normalizations. R package.

    Paper

    Mohamed Nadhir, Djekidel, Yang Chen, and Michael Q. Zhang. “FIND: DifFerential Chromatin INteractions Detection Using a Spatial Poisson Process.” Genome Research, February 12, 2018. https://doi.org/10.1101/gr.212241.116.

  • diffloop - Differential analysis of chromatin loops (ChIA-PET). edgeR framework.

    Paper

    Lareau, Caleb A., and Martin J. Aryee. "Diffloop: A Computational Framework for Identifying and Analyzing Differential DNA Loops from Sequencing Data" https://doi.org/10.1093/bioinformatics/btx623 Bioinformatics (Oxford, England), September 29, 2017.

  • AP - aggregation preference - parameter, to quantify TAD heterogeneity. Call significant interactions within a TAD, cluster with DBSCAN, calculate weighted interaction density within each cluster, average. AP measures are reproducible. Comparison of TADs in Gm12878 and IMR90 - stable TADs change their aggregation preference, these changes correlate with LINEs, Lamin B1 signal. Can detect structural changes (block split) in TADs.

    Paper

    Wang, X.-T., Dong, P.-F., Zhang, H.-Y., and Peng, C. (2015). "Structural heterogeneity and functional diversity of topologically associating domains in mammalian genomes.](https://academic.oup.com/nar/article/43/15/7237/2414371)" Nucleic Acids Research

  • diffHiC - Differential contacts using the full pipeline for Hi-C data. Explanation of the technology, binning. MA normalization, edgeR-based. Comparison with HOMER. Documentation.

    Paper

    Lun, Aaron T. L., and Gordon K. Smyth. "DiffHic: A Bioconductor Package to Detect Differential Genomic Interactions in Hi-C Data" https://doi.org/10.1186/s12859-015-0683-0 BMC Bioinformatics 16 (2015)

  • HiCCUPS Diff - differential loop analysis. Input - two .hic files and loop lists; output - lists of differential loops.

  • Meltron - a statistical framework to detect differences in chromatin contact density at genomic regions of interest

  • TAD callers

    TAD detection, benchmarking

    Architectural features

    Differential, timecourse TAD analysis

  • TADcompare - R package for differential and time-course TAD boundary analysis. Uses SpectralTAD score - spectral decomposition of Hi-C matrices - to statistically detect five types of differential TAD boundaries: merge, split, complex, shifted, strength change. In the time-course analysis, detects six types of boundary score changes: highly common, early appearing, late appearing, early disappearing, late disappearing, and dynamic TAD boundaries. Returns genomic coordinated and types of TAD boundary changes in BED format. Documentation, Bioconductor Package

    Paper

    Cresswell, Kellen G., and Mikhail G. Dozmorov. "TADCompare: An R Package for Differential and Temporal Analysis of Topologically Associated Domains" https://doi.org/10.3389/fgene.2020.00158 Frontiers in Genetics 11 (March 10, 2020)

  • Analysis of the Structural Variability of Topologically Associated Domains as Revealed by Hi-C - TAD variability among 137 Hi-C samples (including replicates, 69 if not) from 9 studies. HiCrep, Jaccard, TADsim to measure similarity. Variability does not come from genetics. Introduction to TADs. 10-70% of TAD boundaries differ between replicates. 20-80% differ between biological conditions. Much less variation across individuals than across tissue types. Lab -specific source of variation - in situ vs. dilution ligation protocols, restriction enzymes not much. HiCpro to 100kb data, ICE-normalization, Armatus for TAD calling. Table 1 - all studies and accession numbers.

    Paper

    Sauerwald, Natalie, Akshat Singhal, and Carl Kingsford. "Analysis of the Structural Variability of Topologically Associated Domains as Revealed by Hi-C" https://doi.org/10.1093/nargab/lqz008 NAR Genomics and Bioinformatics, 30 September 2019

  • BPscore - metric to compare two TAD segmentations. Formula, methods. More stable to Variation of Information (VI) and Jaccard Index (JI). Python implementation for calculating all three metrics.

    Paper

    Zaborowski, Rafał, and Bartek Wilczyński. "BPscore: An Effective Metric for Meaningful Comparisons of Structural Chromosome Segmentations" https://doi.org/10.1089/cmb.2018.0162 Journal of Computational Biology 26, no. 4 (April 2019)

  • Quantifying the Similarity of Topological Domains across Normal and Cancer Human Cell Types - Analysis of TAD similarity using variation of information (VI) metric as a local distance measure. Defining structurally similar and variable regions. Comparison with previous studies of genomic similarity. Cancer-normal comparison - regions containing pan-cancer genes are structurally conserved in normal-normal pairs, not in cancer-cancer. Kingsford-Group/localtadsim. 23 human Hi-C datasets, Hi-C Pro processed into 100kb matrices, Armatus to call TADs.

    Paper

    Sauerwald, Natalie, and Carl Kingsford. "Quantifying the Similarity of Topological Domains across Normal and Cancer Human Cell Types" https://doi.org/10.1093/bioinformatics/bty265 Bioinformatics (Oxford, England), (July 1, 2018)

  • DiffTAD - differential contact frequency in TADs between two conditions. Two - permutation-based comparing observed vs. expected median interactions, and parametric test considering the sign of the differences within TADs. Both tests account for distance stratum.

    Paper

    Zaborowski, Rafal, and Bartek Wilczynski. "DiffTAD: Detecting Differential Contact Frequency in Topologically Associating Domains Hi-C Experiments between Conditions" https://doi.org/10.1101/093625 BioRxiv, January 1, 2016

  • Prediction of 3D features

    SNP-oriented

    CNV and Structural variant detection

    Visualization

    De novo genome scaffolding

    3D modeling

    Deconvolution

    Haplotype phasing

    Papers

    Methodological Reviews

    General Reviews

    Technology

    Multi-omics

    Micro-C

    Multi-way interactions

    Imaging

    Normalization

    Spectral clustering

    Courses

    Labs

    Best-Labs-of-3D-Genome

    Misc