open2c / coolpuppy

A versatile tool to perform pile-up analysis on Hi-C data in .cool format.
MIT License
77 stars 11 forks source link

[Q] How to calculate the significance of clustering difference between to comparison groups? #117

Closed YuboWang1994 closed 1 year ago

YuboWang1994 commented 1 year ago

Hi,

Thanks for this amazing software and it's very powerful.

These days I'm trying to figure out the Hi-C clustering difference of different experimental groups (treated vs WT group) based on the following steps:

Firstly, I applied the juicer pipeline where I aligned the Hi-C data to the reference genome and converted the alignment result to .hic file, i.e. treated.hic and WT.hic.

Secondly, I converted .hic file to .cool file by applying 'hic2cool convert' and 'cooler balance'.

Thirdly, I applied coolpup.py for each .cool file and obtained .clpy with the following script: python coolpup.py --features_format bed WT.cool list.bed -p 10 -o WT.list.clpy python coolpup.py --features_format bed treated.cool list.bed -p 10 -o treated.list.clpy

The list.bed file is a 3-column bed file containg chromosome, start site and end site, respectively.

Finally, I applied plotpup.py and obtained the gene clustering heatmap. It seems like these two figures were consistent with what I expected, where the gene clustering in the treated sample (the second figure) should decrease:

image image

However, what I concern about is, how can I calculate the statistical difference between the two groups, i.e. how to prove that the gene clustering between these two groups are significantly different? Now that these figures were drawn by the .clpy file, is there any method for me to extract certain value so that I can do some calculation work?

Thank you and I'm looking forward to receiving your reply :).

Phlya commented 1 year ago

Hi, happy that coolpuppy is helpful in your work!

First of all, from the pileups it looks like the second is much more noisy... Do you have much fewer regions there? Of it it's the same regions, is the Hi-C data much worse in some way? (fewer reads, lower cis/trans, overall bad quality?..) That would be a problem.

Otherwise, when the result is so clear, there is really no need to invent any statistics in my opinion. Otherwise I think the best statistics is having replicate Hi-C libraries and comparing the enrichment across replicates.

Phlya commented 1 year ago

Assuming this is resolved, feel free to reopen.