vaquerizaslab / chess

Comparison of Hi-C Experiments using Structural Similarity.
Other
26 stars 6 forks source link

Chess sim command runtime improvement heuristics? #18

Open jmcbroome opened 4 years ago

jmcbroome commented 4 years ago

I’m encountering significant runtime problems when calculating a similarity score for a single target region. I’m performing a comparison between two species over a single syntenic region with balanced but not O/E cooler format inputs with 18 threads, though it doesn’t seem to be using most of those threads most of the time. I’m using the --background-query and --limit-background options.

It is at more than six days of runtime at this point and still has not actually returned any output. Is there a way to speed this up significantly, by selecting a restricted background decided in some other way or using different format files? I will be needing to run the sim command many more times on many more regions if I want to incorporate it into my project, and this is unfeasibly slow.

Output below:

2020-11-09 11:28:04,305 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --background-query -p 18 --limit-background human_chr1_hic_test.corrected.cool chicken_hic.corrected.cool chess_test.bedpe chess_single_test.txt' 2020-11-09 11:28:09,670 INFO CHESS version: 0.3.5 2020-11-09 11:28:09,670 INFO FAN-C version: 0.9.7 2020-11-09 11:28:09,673 INFO Loading reference contact data Expected 100% (20923106 of 20923106) |################################################################################################################################################################| Elapsed Time: 1:01:49 Time: 1:01:49 2020-11-09 17:21:29,197 INFO Loading query contact data Expected 99% (25409016 of 25665672) |###################################################################################################################################################### | Elapsed Time: 1 day, 19:55:08 ETA: 0:00:45 Expected 100% (25665672 of 25665672) || Elapsed Time: 6 days, 18:56:44 Time: 6 days, 18:56:44

nickmachnik commented 4 years ago

Hi @jmcbroome , it looks like your run hasn't even gotten to the stage of comparing anything, I think it is still computing the O/E values. The number of regions you are comparing, and the --background-query and --limit-background do no influence this step at all. I think the best you can do here, especially if you want to rerun the command on the same data, is to switch to a data format that stores precomputed O/E values, like juicer of fanc. You can do this e.g. with

fanc from-cooler

The conversion will take a while, but you'll have to do it only once, then loading the data will be a lot faster. Hope this helps. Best, Nick

nickmachnik commented 4 years ago

@jmcbroome , do you get an acceptable runtime after the conversion?

jmcbroome commented 4 years ago

Sorry for not following up more quickly, I've been trying to make it work. I've been precalculating O/E matrices with the HiCExplorer implementation (hicTransform in obs_exp mode) which only takes a few minutes. However, it appears to be hanging on reading in reference contact data, and has spent 3 days so far reading in a obs-exp transformed chicken HiC matrix as reference. It's only using one thread for this step though I gave it access to 18 cores. This seems to be a serious bottleneck problem. I may try breaking my data into a series of syntenic chromosome pairs and associated syntenic region pairs and starting several parallel CHESS runs so that each individual CHESS run has to read in less total contact data, since this step is not multithreaded in the software.

Current output: 2020-11-20 18:09:48,878 INFO Running '/home/jmcbroome/anaconda3/bin/chess sim --oe-input --background-query --limit-background -p 18 chicken_hicexplorer_obsexp.cool human_chr1_hicexplorer_obsexp.cool chess_test_group.bedpe chess_test_group_v2.txt' 2020-11-20 18:09:49,897 INFO CHESS version: 0.3.5 2020-11-20 18:09:49,897 INFO FAN-C version: 0.9.7 2020-11-20 18:09:49,899 INFO Loading reference contact data

It is now 2020-11-23 and no additional lines have been printed.

kaukrise commented 4 years ago

Yeah, this is indeed a bottleneck. Besides the O/E calculation, CHESS currently also reads everything in the reference and query matrices into memory. That also means a lot of data is read that will never be used, because it lies outside of the region pairs of interest.

For txt file input, this is a difficult problem to solve, as one has to iterate over the entire file anyways to see which contacts are relevant. For FAN-C compatible matrices (i.e. Juicer and FAN-C, while Cooler does not have support for expected values on file), however, we might be able to retrieve regions on the fly. Then the sim part of the CHESS run would start practically instantaneously. However, concurrency might become an issue here. PyTables, which is HDF5-backed, really does not like concurrent access...

kaukrise commented 4 years ago

Hello again! I did not have time yet to refactor the CHESS code to load submatrices from file on demand - I'm going to need @nickmachnik's help for this, too, particularly for the multiprocessing part.

However, worked on the Cooler compatibility layer in FAN-C yesterday night and found some inefficiencies, which I fixed. On my machine (SSD), I now get a reading speed of 15-20 million pixels per minute (don't ask what it was before...). This should load high-res Cooler matrices much more quickly. The changes are already available via GitHub and PyPi (FAN-C version 0.9.8). @nickmachnik, maybe you can bump the fanc dependency in CHESS' setup.py?

Cheers,

Kai

nickmachnik commented 4 years ago

I updated the fanc dependency to 0.9.8, will put this on PyPi with the next patch (0.3.6).