Open Yang990-sys opened 4 months ago
And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup
reached 150G, and modkit find-motifs
has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?
I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step
I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step
I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step
Hello @Yang990-sys,
May I ask if deleting all 0 rows will have an impact on it?
For modkit dmr pair
removing bedMethyl records with 0% modification will not yield correct results. If you do, in the case where both conditions have 0% methylation will not be processed at all, and you will get no output for these bases. In the case where the two conditions differ (say one condition has 100% modification and the other has 0%), the algorithm in DMR will not assume that when you remove the records with 0% methylation implicitly means that they are canonical. It will see that there is no data to compare to and emit no output. Do the majority of the records have very low $N_{\text{valid}}$ ? If so you could remove records with low coverage by filtering the data through a pipe before writing it down to the filesystem:
modkit pileup ${modbam} - | awk '$5>5' | bgzip > ${out_filt_bedmethyl}
I think a better option is to partition the analysis into genomic regions, for example chromosomes or Mbp-long regions. Differential methylation works on a genomic "column", so you can process each chromosome (or an interval of a chromosome) separately then combine the results together. You can also pipe the output of modkit pileup
directly into bgzip
to save space writing down the table.
For modkit find-motifs
the answer is a little more tricky, currently the algorithm needs to load the entire bedmethyl table. I'll need to perform some experiments to see if and how I can remove this requirement when working with very large bedMethyl files. A couple things you could try in the mean time:
--context-size
smaller, the default is (12, 12) maybe try (8, 8)--min-coverage
is sufficiently high (this applies to DMR also as I mentioned).And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?
How large is the genome you're using (you previously mentioned studying human methylation). I am working on decreasing the memory usage (and increasing the processing speed) of pileup
, however as I mentioned decreasing memory usage for find-motifs
requires a few experiments on my side.
Hello, I am using modkit to study human methylation. However, the average size of a bed file containing three types of methylation is 300G, which is too large to be analyzed by my process, And in bedfiles, most methylation fractions are 0, Causing inconvenience to subsequent analysis. I am wondering if it is possible to delete all rows with a methylation fraction of 0; And when calculating DMR, the default methylation fraction for unmeasured positions is 0? I mainly use two programs: dmr pair and find motifs; May I ask if deleting all 0 rows will have an impact on it?