We provide the PyTorch implementations for both predicting and training procedures.
DeepHiC is a deep learning model developed for boosting the resolution of Hi-C data. It is based on Generative Adversarial Network. It takes low-resolution data as conditional inputs for Generator Net in GAN and outputs the enhanced Hi-C matrices.
We trained DeepHiC for 300 epochs on chromosome 1-14 in the GM12878 cell line from Rao's HiC and evaluated on remaining data including other two cell lines' data (IMR90, K562).
The framework of DeepHiC. With 10kb high-resolution data as the ground truth, DeepHiC predicts enhanced outputs from low-resolution Hi-C data (1/10 - 1/100 randomly downsampled reads). The structure similarity index (SSIM) between enhanced outputs and real 10-kb high-resolution data achieves 0.8 (1% reads) to 0.9 (4% reads) on average.
DeepHiC is written in Python3 with PyTorch framework. It demands Python version 3.6+
Other python packages used in this repo (version numbers are recommanded):
Note: 1. visdom must be runned in background
- GPU acceleration is strongly recommended.
We used Rao's HiC data from GEO. Take the data in GM12878 cell line as example.
The string variable root_dir
defined in all_parser.py
determines the directory storing all raw and processed data.
Here we unzip Hi-C data into /data/RaoHiC/raw/GM12878
. root_dir
could be customized according to the folder path you store data. But raw data must be putted in $root_dir/raw/[cell_line_name]
We also uploaded the processed data for training our model, and the processed data in GM12878 replicate which could be predicted directly. Just move the data
folder to $root_dir/data
.
Taking the GM12878 cell line data for example. Raw data is stored in directory: $root_dir/raw/GM12878
python data_aread.py -c GM12878
Note: We only provide the script for parsing Rao's data. If you are using different types of Hi-C data, please replace them with yours.
Note: currently, we also provided a script for convert HiC-Pro output to our input in
./scripts/hicpro2deephic.py
. the chrN_10kb.npz files for one cell line data should be putted in$root_dir/mat/[cell_line_name]
Note: except for Hi-C matrices, we also stored the index of bins whose sum greater than zero for each chromosome, a.k.a variable
compact
in the same .npz file
python data_downsample.py -hr 10kb -lr 40kb -r 16 -c GM12878
-hr
: the resolution of high-resolution data-lr
: the low-resolution (just for convenience) for downsampled data, e.g. I named 1/16 downsampled data as chrN_40kb.npz
-r
: the downsampling factor-c
: cell_line_name of datapython data_generate.py -hr 10kb -lr 40kb -s train -chunk 40 -stride 40 -bound 201 -scale 1 -c GM12878
-hr
: the high-resolution data we used for train-lr
: the low-coverage (downsampled) data we used for training input-s
: specifying dataset: train/test/human/mouse-chunk
: chunk size for each small samples-stride
: should be same with -chunk
-bound
: upper bound of genomic distance, e.g. 201 means 200 x 10kb-scale
: whether to pooling input matrices, current version is no, it should be 1 in default.-c
: cell_line_name of datapython train.py
python data_predict.py -lr 40kb -ckpt save/a_suitable_parameter_file.pytorch -c GM12878
-lr
: the low-coverage (downsampled) data we used for input-ckpt
: trained parameters-c
: cell_line_name of data