Genrich with large number of samples

jsh58 / Genrich

Detecting sites of genomic enrichment

MIT License

182 stars 27 forks source link

Genrich with large number of samples #91

Closed bwenz91 closed 4 months ago

bwenz91 commented 2 years ago

Hi, I have successfully used Genrich before with a smaller sample size, but am now attempting to run Genrich with a large number of samples. In my latest attempt, about 1000 bam files are read in before I exceed the memory allocation that I requested. I can use more memory up to a certain point, but am wondering if what I am trying to do is even possible. I noticed in the documentation that the number of input files has little effect on the memory, but maybe I am just trying to use too many input files. It is not clear to me that the analysis options I am using (-j , -v , -q 0.05 , -y) would be causing this large memory usage. We like the tool a lot, so were hoping to utilize it in this case, if possible. Thanks!

jsh58 commented 2 years ago

Thanks for the question. As you noted, the number of input files has little effect on memory, since data structures are reused by Genrich. But I have never tested it on so many BAM files at once. I do not think your analysis options make a difference. Have you tried running it on fewer files -- say 10, 20, 50, 100, 200 -- and observing the memory usage?

bwenz91 commented 2 years ago

Thank you for the response. I have performed multiple runs in the past with various numbers of BAM files, but have not done a systematic analysis of this with identical BAM files, etc. With those runs I noticed that the increase in memory used was not linear, however these were different sets of BAM files, so file sizes may have differed slightly (not sure how significant that would be in the grand scheme of things).

maxdudek commented 1 year ago

In case people come across this in the future: the issue is that Genrich stores a p-value pileup in memory for every sample, and only sums them up after every BAM file has been read. This means that memory uses increases linearly with sample size when n is very large. The solution is to instead keep only one p-value pileup in memory which is a cumulative sum of the sample p-values, which can then be fed into the q-value calculation.

I've implemented that solution in this fork for @bwenz91 and my lab, which may be useful to others looking to run Genrich with > 1000 BAM files.

jsh58 commented 1 year ago

This is a good point. I was imprecise when I said "data structures are reused by Genrich" for multiple input files -- this is true for pileup values, but not for p-values, which are stored in memory. For a large number of input files, especially complicated samples, the memory usage can definitely add up.

Regarding the fork from @maxdudek, I must caution that some outputs (e.g. -f <file>) will not be produced correctly. But if the program produces good results without the excessive memory usage, then well done!