open2c / coolpuppy

A versatile tool to perform pile-up analysis on Hi-C data in .cool format.
MIT License
77 stars 11 forks source link

`pu.pileupsWithControl` might skip some pile-up regions or raise vague `domain_score not in dict2` AssertionError #146

Open nvaulin opened 3 months ago

nvaulin commented 3 months ago

Hi, I bumped into some unclear behavior of coolpup.py. Finally, I managed to overcome the problem, but maybe this issue might be helpful for other users and might help to improve this awesome tool.

TLDR:

AssertionError: domain_score not in dict2 from pu.pileupsWithControl means there are regions you want to pile-up may be intersected or close to bad bins.

Story

I tried to get rescaled pileup of TADs with calculation of TADs scores (strengths). I've done everything according to the Distribution of TAD strength scores part of the tutorial. For your example HFF_MicroC data everything worked fine, but for my data the behavior was strange:

I found out that there were too many NaNs in the expected df. (Yes, I should pay more attention to my data, but back to the coolpup.py). I recalculated expected without respect to the centromeres and the AssertionError disappeared.

But then, after pileup ran without any errors, I found that length of pup.loc[0].domain_score is less the len(tads). I tried to record all the TADs processed by the add_domain_score and found that some of the TADs are missed:

image

So, these unprocessed TADs were skipped silently, without any warning, When I tried to run pu.pileupsWithControl specifically on the unprocessed TADs, it gave me AssertionError: domain_score not in dict2 again.

These unprocessed TADs were first or last TADs on their chromosomes and, apparently, are intersected with Hi-C map black stripes ("bad bins"). For their snippets data was containing only NaNs. After I performed more strict TADs filtering - pileup ended up fine for my data.

So the conclusions for the users:

Conclusions for the coolpuppy team

Again, in my case it just solved by more strict filtering, however probably there may be cases when such TADs skippings may spoil the results.

I don't know weather this issue might be useful, especially I am not able to provide my data right now. You can close it if you want. If you will be discovering this case, then I can assist you.

System

Sincerely, Nikita

Phlya commented 3 months ago

Thank you for reporting it Nikita! This is an important issue, at the very least regions should not be silently skipped for no clear reason... It would be more useful if you could provide some reproducible example even with some random public data... But we'll try to look into it when we find the time.