vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

It's difficult to find any significant changes in 3D organization of chromatin in highly dissimilar regions #34

Open biozzq opened 3 years ago

biozzq commented 3 years ago

Dear all,

After finishing chess sim, i extracted the highly dissimilar regions defined by low z-ssim and high SN values. After visualization, i found that it is difficult for me to catch any significant changes in 3D organization of chromatin in these regions. I have attached all the figures (https://drive.google.com/file/d/1TBcSvr6QKDqioP0EEefldHmhEjIPq4JE/view?usp=sharing), does it mean my results have a high false positive (FP)? If that is, do you have any suggestions to reduce the FP? Thank you very much.

Best wishes, Zheng zhuqing

nickmachnik commented 3 years ago

Hi @biozzq , you can try to user harder thresholds for you SN and z-ssim values. If even the lowest regions with the lowest z-ssim values don't have obvious changes, there just might be none that chess can pick up. You should also check the variance of the ssim values; if they fluctuate just a little bit about some global mean, this might also indicate that chess cannot detect any strong changes.

liz-is commented 3 years ago

Hi,

It might help to know how you have selected your regions (what thresholds) and how you have plotted them. I had a quick look at some of your plots and found some, like the one below, where I can see a difference by eye that doesn't appear in the log2 fold change plot, so I wonder if there is an issue with the log2 fold change calculation or the plotting parameters. 19600-chr15-61200001-63200001

biozzq commented 3 years ago

Dear all,

Thanks for all your help. Sorry, i can not access my data now, however, I will come back when I access my data and scripts.

Best wishes, Zheng zhuqing

biozzq commented 3 years ago

Dear all,

I am sorry for the later reply. To @nickmachnik The distribution of the ssim is as following, and the average and std of the ssim are 0.649287 and 0.167805 repectively. Can we use these information to deduce more useful guidelines in the following analysis? AC-1_VS_AC-2 To @liz-is These regions were selected by SN > 0.5 and z_ssim < -1. I have made some changes to the script to generate the figures in batch, the changed script is attached here chess.visual.sim.results.zip. The command used to generate these figures is as following: awk '$2 > 0.5 && $4 <-1' AC-1_VS_AC-2.tsv | cut -f 1 | while read id; do python chess.visual.sim.results.py AC-1_VS_AC-2.tsv 2mb_win_100kb_step.bed AC-1_25K_KR.hic AC-2_25K_KR.hic $id AC-1 AC-2; done

Thank you for all your help, best wishes!

Zheng zhuqing

liz-is commented 3 years ago

Have you looked at the regions with the lowest ssim? That's quite a skewed distribution, maybe z_ssim < -1 is not a strict enough threshold for this data.

biozzq commented 3 years ago

Dear @liz-is

Thank you. I have attached the figures of the regions with lowest ssim, it is difficult for me to identify any significant changes in 3D organization of chromatin.

ID      SN      ssim    z_ssim
20518   0.4043157338659144      0.07312256553562313     -3.4335304190818587
14467   0.36756863596467293     0.07342521262455935     -3.431726857620752
17217   0.5396703277611689      0.07417451962317224     -3.4272615206833117
17218   0.5315010852897838      0.07776181958389573     -3.4058837635691375

14467-chr11-31000001-33000001 17217-chr13-169100001-171100001 17218-chr13-169200001-171200001 20518-chr16-14500001-16500001

Best wishes, Zheng zhuqing

nickmachnik commented 3 years ago

These look very noisy, I cannot see any changes either. Have you SN filtered the data? If not, what does your distribution of ssim look like after SN filtering?

liz-is commented 3 years ago

I agree with Nick that these regions look very noisy and that SN filtering may help. I don't know what your SN distribution looks like, but in some projects I've used SN above the genome-wide or chromosome-wide median as one of the filtering criteria, rather than using a fixed cutoff. That's a pretty relaxed threshold, however, you may want to use only the top 25% of SN values -- this threshold also has to be adjusted to your data.

I also think these regions look a bit strange, since none of them have any structures (boundaries, TADs, loops...) - they look almost like mitotic Hi-C data. This was also the case for many of the regions in the full set of plots, if I remember correctly. The datasets that CHESS was tested on all had clear structures. This is just speculation, but perhaps if your data is unusually unstructured, it'll be more tricky to find appropriate parameters or harder for CHESS to identify real differences. You might also want to think about using a different resolution or window size to avoid having so many windows without distinct features.

biozzq commented 3 years ago

Dear all,

Here I attached a two-dimensional figure that contains the distribution of both SN and ssim. I cannot see many regions with low ssim and high SN score. Does it mean that these two samples show similiar interaction in 3D organization of chromatin? ssim_SN

To @liz-is , do you have any experience to identify the optimal resolution when analyzing hic data? As for window size, the recommendation is 100X more than the resolution.

Best wishes, Zheng zhuqing

liz-is commented 3 years ago

The shape of the distribution looks similar to some I've seen before, in samples where I've been able to find at least a few regions that have visible changes and low ssim / higher SN.

I don't really know anything about your data, and as I said above it seems like it might be unusual, so it's hard to give recommendations. What I can say is that I always run CHESS with multiple window sizes (e.g. 100x, 150x) and resolutions (e.g. 5 kb, 10, kb, 20 kb) on any new datasets, to see how similar the ssim score across the genome is across different parameters (it should be similar - if not, it's driven by noise and you should consider using a lower resolution). You can also reduce false positives by only taking regions which come up with multiple parameter combinations. I usually don't use the highest resolution that I would use for visualising my data for CHESS - i.e., if it looks good and/or reaches the threshold set by Rao et al 2014 at 5 kb resolution, I would use 10 kb for CHESS. In my experience the highest possible resolution is often too noisy to give good results (for what it's worth I also find this to be the case for insulation / TAD calling, too). Nick might have some other suggestions as well. I hope that helps - really I think you always need to try a few different parameter combinations until you find what gives robust results on any new dataset. It's easier for samples that have really obvious systematic changes, but most datasets don't have those.

mujahida87 commented 3 years ago

Hi @liz-is I'm a beginner and have a very basic question. can we also use "fanc compare" instead of "chess" to see the 3D chromatin organization differences between two developmental stages of the same species? thank you

kaukrise commented 3 years ago

Hi @mujahida87,

thanks for the important question.

fanc compare does a simple, pixel-wise comparison of two Hi-C matrices, for example as a difference or fold-change matrix. If you already know where conformational changes are expected, this might be all you need. You can simply manually look at the region to confirm that expected differences are visible in the comparison matrix. The API docs have a number of examples that illustrate this: https://fan-c.readthedocs.io/en/latest/api/analyse/comparisons.html.

However, what if you don't know where the differences between two matrices are? It would be a daunting task to manually inspect every single region in the genome to find any differences by eye. For extremely high resolution matrices, such as Micro-C, this may be almost impossible. This is what CHESS does for you. When you feed it two matrices, it not only calculates the comparison, but it identifies and ranks genomic regions by how different they are! This greatly simplifies looking for 3D conformational differences automatically, in an unbiased manner. (Additionally, you can even classify differences into different types).

You can still use fanc compare and fancplot for visualisation, however.

I hope this clarifies the issue! Cheers, Kai

mujahida87 commented 3 years ago

Hi @kaukrise,

Thank you very much for the detailed explanation :) It really helps to get a basic understanding before starting my analysis.

I have run CHESS for 3D chromatin differences between two of my samples and get the tsv file with differences (image attached). But what I'm confused about now, how to visualize these differences as you have shown in the example. Is there a way to plot them using FANC or any other visualization tool, because I'm not familiar to use jupyter notebook. Screenshot 2021-05-27 at 10 59 24

Thank you

kaukrise commented 3 years ago

The ID column corresponds to the region in the pairs file you specified when running CHESS. If you are familiar with FAN-C already, just plot that region using either fancplot or the Python API. You can also compute the comparison matrix using fanc compare, as you mentioned, and plot it the same way.