jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
185 stars 27 forks source link

Question: how does Genrich determine the reference genome? #1

Closed mchimenti closed 5 years ago

mchimenti commented 5 years ago

In the docs you write: "Genrich runs very quickly but uses a considerable amount of memory. For starters, it requires 3 bytes for every base-pair of the reference genome, i.e. ~9GB for a human sample. The number of input files has little effect on memory, but certain analysis options (especially the option to remove PCR duplicates) can greatly increase the memory usage, particularly with large SAM/BAM input files. See above for an example."

Does Genrich determine the reference genome information from the SAM header?

Does Genrich download a reference genome from somewhere while doing its peak calling?

thanks, Michael

jsh58 commented 5 years ago

Michael,

Thanks for the question. Yes, Genrich determines the reference genome length from the SAM/BAM header, as described here. It does not use any other information about the reference genome, and it does not download anything, ever.

John Gaspar

mchimenti commented 5 years ago

John, thanks for your response. I ask b/c I'm trying to incorporate genrich into a NextFlow-based ATAC-seq pipeline. It errors out on a very small test dataset with status "137": job killed for insufficient memory.

What would you say is the minimum memory allocation for a genrich job, even on a "toy" dataset? Should I allocate at least 9GB as you say in the docs, even for a small test set?

jsh58 commented 5 years ago

For human samples, you will need at least 9GB because of the genome length. The size of the dataset does not affect the size of the genome.