ma-compbio / Higashi

single-cell Hi-C, scHi-C, Hi-C, 3D genome, nuclear organization, hypergraph
MIT License
78 stars 11 forks source link

Time consumption #11

Closed JohnGenome closed 3 years ago

JohnGenome commented 3 years ago

Hi @ruochiz , I would like to try Higashi using public data. And is there any information about the time consumption of model training? Does an entry level PC(31GB, i7-4770, GTX750Ti) meet the hardware requirements?

ruochiz commented 3 years ago

Hi,

Thanks for your interests in our method. I just updated the wiki page with some runtime information.

https://github.com/ma-compbio/Higashi/wiki/Higashi-Usage#runtime-of-higashi

A more detailed discussion on the runtime can be found in our published version of the manuscript. For your machine, I think the bottleneck could be the memory. The consumption of memory depends on the number of cells, sequencing depths of the dataset and the resolution. 32GB should be enough for most datasets at 1Mb resolution.

JohnGenome commented 3 years ago

I'm new in deep learning and i have access to a node only with CPUs (maybe 20 cores and 256 GB). Is training only with CPU feasible? THX!!!

JohnGenome commented 3 years ago

Sorry for asking too many questions in an issue. I don't understand the measuring of runtime from wiki page. A scHiC dataset (e.g. Nagano et al. dataset) with 1,171 single cells and 56,800 median contacts per cell has about 7e7 observed positive triplets. When training with cd-GNN (k=4, fast mode) as wiki page shows, it takes 7e7/(192*1000)epoch * 109.6s/epoch = 40000s = 11h to go over the whole training dataset once. Does the formula make sense?

ruochiz commented 3 years ago

Yes, training with CPU is feasible.

The formula makes sense. But you do not need to go over the whole training dataset once. Our test showed that, about 45 ~ 60 epochs are enough for most of the dataset.