Questions when I process the Hi-C data

tanjimin / C.Origami

C.Origami, a prediction and screening framework for cell type-specific 3D chromatin structure.

60 stars 9 forks source link

Questions when I process the Hi-C data #23

Closed YCMaCY closed 1 year ago

YCMaCY commented 1 year ago

Hi! Dear author. I encountered problems when I tried to train the model of hg19. Neither using nf-core/hic(https://nf-co.re/hic) nor converting the .mcool file on 4DN into npz could I train the exact model, and the prediction result was empty. In addition, I found that the train loss were very low. Could you please tell me more about how to process the Hi-C data？

tanjimin commented 1 year ago

Hi @YCMaCY, there are two parts to your question:

Training with hg39 does not work: We processed all genomic features based on hg38, so if you want to train on hg19 you would need the hg19 version of DNA sequence, CTCF, ATAC-seq, and Hi-C. It could work but we haven't tried that and don't know what could happen. I don't recommend this route.
How to generate an hg38 Hi-C matrix: As we mentioned it on the main README file (https://github.com/tanjimin/C.Origami#hi-c-data) , you can use HiC-Pro to generate a .cool file from fastq sequencing data. Then you can use our script (https://github.com/tanjimin/C.Origami/blob/main/src/corigami/preprocessing/cool2npy.py) to convert .cool file to .npz data which can be used for training.

YCMaCY commented 1 year ago

Thank you very much for your reply, I realized that the .mcool file output from nf-core/hic pipeline is not compatible with corigami, I reused HiC-Pro to process Hi-C data and now I successfully trained the model for hg19. But I have another problem with Hi-C data preprocessing, did you balance(like KR normalization) the .mcool file before convert it into .npz file?

tanjimin commented 1 year ago

Hi @YCMaCY , glad to know it worked on hg19! When I train the model I was used hic-bench (A tool from our group) and ICE normalization that comes with it. I think KR is better than no normalization. Normalization usually gives better quality and biologically more accurate given the bias in restriction enzymes.