vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

Prepare the normalized hic input for CHESS #35

Open biozzq opened 3 years ago

biozzq commented 3 years ago

Dear all,

Recently, I found that FAN-C ran slowly and have discussed here (https://github.com/vaquerizaslab/fanc/issues/42). I will try the pipeline recommended by @kaukrise . Does anyone have experience to prepare the input for CHESS without using FAN-C? Hope you can share us your experience.

Best wishes, Zheng zhuqing

kaukrise commented 3 years ago

Since this seems to be a point of misunderstanding, I just want to note that CHESS uses FAN-C internally also for Juicer files. As I mentioned in the issue you linked, in my opinion you can just use the Juicer file generated from your HiC-Pro data directly as CHESS input without the need for further FAN-C preprocessing.

Also see here: https://fan-c.readthedocs.io/en/latest/fanc-executable/compatibility.html

biozzq commented 3 years ago

Dear @kaukrise

Thank you. I have tried Juicer, and the hic file looks like following, as you said, this can be directly used as CHESS sim input (allValidPairs.hic@25kb@KR; the same for the fanc.load when visualization). Because the speed of Juicer is so much quick than FAN-C, I do not know I have done the right things, hope you can help me to confirm that.

java -jar juicer_tools_1.22.01.jar validate allValidPairs.hic
Reading file: allValidPairs.hic
File has normalization: VC
Description: Coverage
File has normalization: VC_SQRT
Description: Coverage (Sqrt)
File has normalization: KR
Description: Balanced
File has normalization: SCALE
Description: Fast scaling
File has zoom: BP_2500000
File has zoom: BP_1000000
File has zoom: BP_500000
File has zoom: BP_250000
File has zoom: BP_100000
File has zoom: BP_50000
File has zoom: BP_25000
File has zoom: BP_10000
File has zoom: BP_5000
File has zoom: BP_1000

Best wishes, Zheng zhuqing

biozzq commented 3 years ago

Dear all,

When using Juicer pipeline to prepare the input hic file for chess, I found the results are very strange for the comparisons that have unequal effective valid read pair. I attached some examples. Here, the validpairs read number are 150M and 560M for mmCM10 and mm8WM, repectively. These results just let me wonder whether chess has taken the effective valid pair reads number into consideration during normalisation. Thank you. 6182-chr13-61700000-64700000 7171-chr14-43100000-46100000

Best wishes, Zheng zhuqing

kaukrise commented 3 years ago

Juicer scales matrices back up to the original number of reads after normalisation. Hence, the results you are seeing are not very strange, but absolutely expected for unequal read depth.

As CHESS comparisons are based on O/E matrices, this is already taken into account.

(To fix your image, though, scale the matrices to the same read depth before calculating the log2-fold change.)

biozzq commented 3 years ago

Dear @kaukrise ,

Thank you. Do you mean that chess sim has taken the sequencing depth into account but not in the visualization? If so, I need scale the matrices before visualization as you said. It also brings me to another question. Do i need scale the matrices before visualization when using fanc to prepare the normalized hic input for CHESS (fanc hic -b 25k -r 0.25 -n -m KR -s ${input}.statistics -t 4 --reset-filters ${input}_fragment_level.hic ${input}_25K_KR.hic)? If so, I think we can give some comments here (https://github.com/vaquerizaslab/chess/blob/master/examples/dlbcl/example_analysis.ipynb) to help user to better visual the CHESS results.

Also, can you share us how to do the scaling in the example_analysis?

All the best, Zheng zhuqing

biozzq commented 3 years ago

Dear @kaukrise

Sorry for bothering you again. Could you show us an example about scaling the matrices to the same read depth when visualing the results? Thank you.

Best, Zheng zhuqing

kaukrise commented 3 years ago

There are several ways of doing this. If you used FAN-C to do the normalisation with default settings for KR or ICE norm, there is no need to scale matrices.

If you have other types of normalisation, you can explicitly calculate a scaling factor for the whole matrix using

import fanc
hic1 = fanc.load('/path/to/first.hic')
hic2= fanc.load('/path/to/second.hic')

s = hic1.scaling_factor(hic2)

Then you would use s to scale m2 using m2_scaled = s * m2. And then visualise using m2_scaled as normally.

biozzq commented 2 years ago

Hi @liz-is

I found that you have downsampled the data to sequencing depth in the preprint , thus it is necessary to keep the data set with the similar depth before running CHESS comparisons. More, it seems that you applied a downsampling algorithm to BAM file, is this right?

Thanks for your attention.

Best wishes, Zheng zhuqing

liz-is commented 2 years ago

Hi, If you're just comparing two samples (i.e. making one CHESS comparison), the sequencing depth is not so important. As Kai said above, since CHESS works on the obs/exp matrices unequal sequencing depth between the query and reference is not an issue. But if you are making multiple CHESS comparisons involving samples of very different sequencing depths, the sequencing depth can be a confounding factor.

In the preprint we used fanc downsample to downsample valid contacts from Hi-C matrices or pairs files. You'll need to convert matrices to FAN-C .hic format before downsampling if they are not already in that format, and you'll need to renormalise them after.