jernst98 / ChromHMM

GNU General Public License v3.0
74 stars 18 forks source link

WGBS binarization #46

Closed makc-sel closed 2 years ago

makc-sel commented 2 years ago

Hello! I'm trying to add WGBS data to ChromHMM model, and I ran into the issue with binarizing WGBS bedgraph file. From BSBOLT software I got the file containing those columns:

  1. Chromosome
  2. Start Position
  3. End Position
  4. Methylation Percentage, percentage of methylated bases to total observed bases
  5. Methylated Bases, methylated nucleotides observed
  6. Unmethylated Bases, total unmethylated bases The paper (https://www.nature.com/articles/s42003-021-01756-4) suggests this method to process data:

    For WGBS data, BED files were downloaded from the ENCODE portal (Supplementary Data 1), These files contain, among other values, the percent methylation at each CpG dinucleotide in the genome (ranging from 1–100). For each set of two replicates, these values were averaged in 200-bp genomic bins to obtain the mean percent methylation of CpGs in each window. The 200-bp bins were subsequently binarized based on a 50% methylation threshold. Bins that did not contain any CpGs were marked as missing data, as specified by the ChromHMM binarized data format.

But ChromHMM LearnModel seems not to support missing values. Can you please suggest the way to deal with them? And how can we binarize bedgraph file?

jernst98 commented 2 years ago

ChromHMM LearnModel should handle missing values. If there is a '2' in the binary data it should be treated as missing. Why do you think it is not supporting missing values?

makc-sel commented 2 years ago

Thanks for this hint and also for the fast answer! I've tried -1 instead of 2 and it did not worked. Now it worked out. Maybe you have the function with above-cited algorithm implicated somewhere?

jernst98 commented 2 years ago

Glad it works now. Not sure I understand your last question, but note I am not an author of the specific paper you cited above.