nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
386 stars 182 forks source link

Out of memory when executing ice normalization #170

Closed FenghuJi closed 4 years ago

FenghuJi commented 6 years ago

Hi, I am using Hic-Pro to get the interaction matrix within my de novo-assembly human genome as reference . The size of my sequencing data is of 545Gb without compressing . My de novo assembly genome size is 2.8Gb . Now I succeed in the previous steps but failed in iced normalization .I noticed the memory used is out of my machine ,of 187Gb memory. If I set my bin size >=40kb ,I can run successfully .My enzyme that was used is MobI ,So I want to know if I can change some parameters or change some config information to run the program successfully with the bin size smaller than 40kb . Thanks a lot. Hugo

nservant commented 6 years ago

Hi Hugo, The size of your sequencing data is not an issue. The only limitation is the resolution of the contact maps. Could you check which format you defined in MATRIX_FORMAT ? it should be set to 'upper'. Then, if you really have trouble with the normalization at high resolution, you can still do it per chromosome. You can use the utility here to split the matrix per chromosome. https://github.com/nservant/HiC-Pro/blob/devel/bin/utils/split_sparse.py However, be aware that in this case, the ICE assumption of equal visibility (genome-wide) can be discussed .. as you are working per chromosome. best

BioinformaticsXin commented 5 years ago

Dear Servant, I meet the same issue, so I split the matrix per chromosome, and get the ICE matrix per chromosome, I know the ICE assumption of equal visibility (genome-wide) change to equal visibility (per chromosome). Dose this will effect my result greatly if I use the merge matrix to run Fit-Hi-C? I don't know whether this is right. Best wishes

nservant commented 5 years ago

Hi Bonnie, I think that there are two questions in your last post ; 1/ can we applied ICE per chromosome, 2/ can you merge several contact maps ? Regarding ICE, the methods was published to be run genome-wide, especially because the assumption of equal visibility makes more sense at this level. But in practice, I know that many papers are published with a ICE normalization per chromosome. If you assume that the equal visibility assumption holds true chromosome-wide, I think it's fine ... Now, regarding the merge of contact maps, you can absolutely do that, but on the raw data ! not on the normalized ones. Note that the RAM issue is mainly linked to the size of the contact maps, so if you merge too maps, the memory usage should not change that much (if the sparsity remains quite stable).

BioinformaticsXin commented 5 years ago

Dear Servant, Thank you for your help, this is really helpful! Best wishes