nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
136 stars 7 forks source link

How does Modkit handle Large Genome Data? #190

Open Yang990-sys opened 4 months ago

Yang990-sys commented 4 months ago

Hello, I am using modkit to study human methylation. However, the average size of a bed file containing three types of methylation is 300G, which is too large to be analyzed by my process, And in bedfiles, most methylation fractions are 0, Causing inconvenience to subsequent analysis. I am wondering if it is possible to delete all rows with a methylation fraction of 0; And when calculating DMR, the default methylation fraction for unmeasured positions is 0? I mainly use two programs: dmr pair and find motifs; May I ask if deleting all 0 rows will have an impact on it?

Yang990-sys commented 4 months ago

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

Yang990-sys commented 4 months ago

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

Yang990-sys commented 4 months ago

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

Yang990-sys commented 4 months ago

I have read the documention of Performance Consideration;but it has almost no effect on my doubts. For 300GB input files, the memory has already exploded before seed searching step

ArtRand commented 4 months ago

Hello @Yang990-sys,

May I ask if deleting all 0 rows will have an impact on it?

For modkit dmr pair removing bedMethyl records with 0% modification will not yield correct results. If you do, in the case where both conditions have 0% methylation will not be processed at all, and you will get no output for these bases. In the case where the two conditions differ (say one condition has 100% modification and the other has 0%), the algorithm in DMR will not assume that when you remove the records with 0% methylation implicitly means that they are canonical. It will see that there is no data to compare to and emit no output. Do the majority of the records have very low $N_{\text{valid}}$ ? If so you could remove records with low coverage by filtering the data through a pipe before writing it down to the filesystem:

modkit pileup ${modbam} - | awk '$5>5' | bgzip > ${out_filt_bedmethyl}

I think a better option is to partition the analysis into genomic regions, for example chromosomes or Mbp-long regions. Differential methylation works on a genomic "column", so you can process each chromosome (or an interval of a chromosome) separately then combine the results together. You can also pipe the output of modkit pileup directly into bgzip to save space writing down the table.

For modkit find-motifs the answer is a little more tricky, currently the algorithm needs to load the entire bedmethyl table. I'll need to perform some experiments to see if and how I can remove this requirement when working with very large bedMethyl files. A couple things you could try in the mean time:

And for large genomes, memory usage is also very scary, the peak memory usage of modkit pileup reached 150G, and modkit find-motifs has exploded a server with 500GB of memory. Is there any optimization for this aspect in the later stage?

How large is the genome you're using (you previously mentioned studying human methylation). I am working on decreasing the memory usage (and increasing the processing speed) of pileup, however as I mentioned decreasing memory usage for find-motifs requires a few experiments on my side.