vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

Could not compute similarity for... #30

Closed liz-is closed 3 years ago

liz-is commented 3 years ago

Hi folks,

Some of my region pairs are being deemed invalid, but I don't think they fall into any of the possible reasons given. Do you have any other ideas what the issue might be? Is there a way I can get more diagnostic info to try to debug this myself (without having to dig deep into the code and run each step manually, which I can do if necessary)?

Here's the error message:

2021-01-15 14:35:17,634 INFO Running '/home/research/vaquerizas/liz/project_ko/.ko_venv/bin/chess sim data/hic/ko_Rep1/hic/ko_Rep1_10kb.hic data/hic/wt_Rep1/hic/wt_Rep1_10kb.hic data/chess/dm6_pairs_150x_10kb.bedpe data/chess/ko_Rep1_vs_wt_Rep1/genome_scan_150x_10kb.txt -p 8'
2021-01-15 14:35:26,020 INFO CHESS version: 0.3.6
2021-01-15 14:35:26,021 INFO FAN-C version: 0.9.11
2021-01-15 14:35:26,052 INFO Loading reference contact data
2021-01-15 14:38:42,767 INFO Loading query contact data
2021-01-15 14:43:31,332 INFO Loading region pairs
2021-01-15 14:43:31,690 INFO Launching workers
2021-01-15 14:43:33,110 INFO Submitting pairs for comparison
2021-01-15 14:45:01,759 INFO Could not compute similarity for 6316 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins
2021-01-15 14:45:20,267 INFO Finished '/home/research/vaquerizas/liz/project_ko/.ko_venv/bin/chess sim data/hic/ko_Rep1/hic/ko_Rep1_10kb.hic data/hic/wt_Rep1/hic/wt_Rep1_10kb.hic data/chess/dm6_pairs_150x_10kb.bedpe data/chess/ko_Rep1_vs_wt_Rep1/genome_scan_150x_10kb.txt -p 8'
Closing remaining open files:data/hic/ko_Rep1/hic/ko_Rep1_10kb.hic...donedata/hic/wt_Rep1/hic/wt_Rep1_10kb.hic...done

This is Drosophila Hi-C data. I've tried different resolutions and two different window sizes (100x and 150x the bin size). The pairs file for each parameter combo was generated with chess pairs from the same text file with the chromosome sizes (and these files look okay to me from a quick glance).

In each example, all bins from certain chromosomes are missing! In particular, chr 2R and 3R. However I get results for these chrs at 25kb resolution so I don't think there is a chromosome naming mismatch between the files or anything like that.

Screenshot 2021-01-15 at 15 35 34 (N.B., it makes sense that there are no valid pairs on chr 4 at 25kb resolution, since I'm using a window size of at least 2.5 Mb, which is larger than the chromosome size. Same for 10 kb resolution with 150x window size)

I would have thought that it would be a resolution issue (i.e. too many unmappable bins), but having plotted each chromosome at 10kb resolution in both my query and my reference, they look fine. Some unmappable bins but I'd expect to get some results - they don't look any worse than other chromosomes. wt_Rep1_10kb_2R

I'm happy to look into this further myself since I have some familiarity with the code by now, but I'm not really sure where to start. Do you have any ideas?

I am using a development version of FAN-C, but @kaukrise said that it should work fine.

Also, as a more general comment, would it be possible to implement a more informative version of this message? 2021-01-15 14:45:01,759 INFO Could not compute similarity for 6316 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins I've seen other questions relating to this, so it seems like a common issue/point of confusion. Although most of the time this is easy to solve, it would be helpful to know which of those three possibilities accounts for the invalid pairs as a starting point for debugging.

kaukrise commented 3 years ago

Hey @liz-is ,

thank you for the detailed bug report. Can you please try to plot the O/E matrix of a chromosome (or part thereof) that fails? I have a suspicion that the expected values might be the issue here, in which case this is probably related to the FAN-C dev version.

Thanks!

liz-is commented 3 years ago

Thanks for looking into this Kai! Here's the O/E matrix for the same dataset and chromosome.

wt_Rep1_10kb_2R_oe

nickmachnik commented 3 years ago

Hey Liz! There is a lot of white in this matrix, which according to the colorbar is oe=1. Are all these values actually 1 or very very close to 1? 1 is the default masking value for unmappable pixels in chess. All 1 matrix rows are marked as unmappable rows if the row sum equals the row length (looking at the code now this already doesn't seem ideal to me). This is not done for the whole chromosome matrix, but only on the submatrices that are compared; so a row doesn't have to be all 1 for the whole chromosome, only in a particular compared region in order to be marked as unmappable. You could try to increase the fraction of unmappable bins that chess permit with --mappability-cutoff (maybe 0.5 or even higher?). This is not a fix, but might point out if this bug has something to do with false masking or computation of oe values.

kaukrise commented 3 years ago

Hi @nickmachnik ,

this was an issue with the FAN-C development version, which we could figure out independently, so I am closing this!