open2c / coolpuppy

A versatile tool to perform pile-up analysis on Hi-C data in .cool format.
MIT License
77 stars 11 forks source link

[Q]Why is Pile-ups of interactions between a set of regions is so time cosuming? #132

Closed jiangshan529 closed 1 year ago

jiangshan529 commented 1 year ago

Hi, I am trying to run Pile-ups of interactions between a set of regions using bed file. I am using 16CPU and 60GB memory, however, for thousands of peaks from bed files, it has been run for 3 days and I still didn't get the result. Is there a way to increase the efficiency?

The code I am using is this: coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""

efriman commented 1 year ago

Hi,

Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?

jiangshan529 commented 1 year ago

Hi,

Sorry you're having trouble. Without more details it's hard to know. Are you getting any output at all? Thousands of peaks is certainly not too much and shouldn't take long. Which version are you using?

Hi, Elias. I am using 1.0.0. the cool file is at 5kb resolution, and the bed file looks like this: chr1 629947 629948 chr1 634029 634030 chr1 869978 869979 chr1 904778 904779 chr1 921225 921226

Another wierd thing I got when plotting local regions using this dataset, it gives some unexpected diagnol lines(as circled in red): image

How should I deal with this? Thanks!

efriman commented 1 year ago

For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.

Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.

jiangshan529 commented 1 year ago

For the stalled command, try running it with --nproc 1 and see if that gives you any errors. You can also try using --subset to use fewer regions to see if it's a speed issue or something else.

Regarding the lines in your plot, that's most likely coming from the data itself. So have a look at some regions you piled up in your cooler and see if you can spot anything weird.

I am using 1 core and it has been run for overnight but I still didn't get the result coolpup.py --features_format bed aa.cool bb.bed --outname cc_250pad.clpy --flank 100000 --n_proc 1 --clr_weight_name "" plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf

efriman commented 1 year ago

It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.

jiangshan529 commented 1 year ago

It should never run for that long. In your command it looks like you are running coolpup.py and plotpup.py in the same script, which won't work (but maybe it's just a paste error). Otherwise, try the coolpup.py command but adding e.g. --subset 100 and see what happens. It should give results in minutes.

Hi, I used --subset 100. Now the speed is very fast, but the result looks weird. I am plotting a well-characterized hic data on CTCF-centered bed file, there should be an enrichment in the center. However, the heatmap is very noisy. image

Phlya commented 1 year ago

With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?

jiangshan529 commented 1 year ago

With subset of 100 it's expected to be noisy. How many regions do you have in your dataset?

5600 peaks.

Phlya commented 1 year ago

5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?

jiangshan529 commented 1 year ago

5600 should not take that long, totally reasonable size. Should be on the order of minutes, not days. Maybe try increasing subset gradually and see how it changes the time for you?

Hi, when I run --subset 1000, it runs 3min; --subset 2000, 20min; --subset 3000, 3h. I am using 16 cores. It's so wierd, do you have any ideas that I can solve it?

Phlya commented 1 year ago

Well, to be honest, you probably want to set --maxdist to something around 1Mb, if this are CTCF sites... That will speed it up a lot.

jiangshan529 commented 1 year ago

coolpup.py aa.cool bb.bed --nshifts 10 --mindist 100000 --outname cc.txt --flank 30000 --n_proc 16 --clr_weight_name ""

Hi, I tried "--mindist 1000 --maxdist 1000000 --n_proc 8" on CTCF cites(60000 peaks). It's been over one day, I still didn't get a result.

jiangshan529 commented 1 year ago

plotpup.py --input_pups cc_250pad.clpy --not_symmetric --output dd_dot.pdf

when I run --local, it's really in minutes. However, when to run not local, I cannot get a result within several days.

Phlya commented 1 year ago

Can you try using expected instead of shifts? Or just 1 shift instead of 10? And 60_000 peaks is a lot, so it would take time, but not days...

jiangshan529 commented 1 year ago

Can you try using expected instead of shifts? Or just 1 shift instead of 10? And 60_000 peaks is a lot, so it would take time, but not day

--maxdist 3000000 --nshift 1 really works(16 cores take 9h to compute 60000 peaks)! And I think to set maxdist to 3Mb makes sense, there's no need to compute all of the peaks.

efriman commented 1 year ago

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

jiangshan529 commented 1 year ago

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000

jiangshan529 commented 1 year ago

I think for such large sets it's better to normalize by expected (see the tutorial) instead of nshifts. Good that it works though!

Hi, is expected matrix calculated by cooltools? cooltools expected-cis --view hg38_arms.bed -p 2 -o test_expected_cis.tsv test.mcool::resolutions/5000

by the way, what is hg38_arms.bed

Phlya commented 1 year ago

https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#

jiangshan529 commented 1 year ago

https://cooltools.readthedocs.io/en/latest/notebooks/contacts_vs_distance.html#

Thanks for your prompt response!

Phlya commented 1 year ago

Hope this is resolved now, feel free to reopen.