nloyfer / wgbs_tools

tools for working with Bisulfite Sequencing data while preserving reads intrinsic dependencies
Other
134 stars 37 forks source link

merge #21

Closed joe448 closed 1 year ago

joe448 commented 1 year ago

I saw that the WBC-WGBS (n=23) samples were uploaded. When trying to merge all 23 of them with: wgbstools merge -T tmp -p WBC-merged *-WBC-WGBS-Rep1.pat.gz it runs, generates a tmp file with several Gb and finishes writing WBC-merged.pat.gz without error. However if I check the coverage with beta_cov after generating a .beta from WBC-merged.pat.gz I get _WBCmerged 238.43 and e.g. _GSM6810026CNVS-NORM-110000263-WBC-WGBS-Rep1 82.86

Shouldn't it be much higher (>1'000x) considering 23 samples with a coverage of 50-80x?

nloyfer commented 1 year ago

Right. This is the beta format's weak point. To keep the beta files small (~54Mb), we represent all ~56M values (number of methylated observations, number of observations for each CpG) with uint8 (i.e. 8 bits). This means the maximal value in a beta file is 255. Usually it's enough (for WGBS data), but sometimes it's not. When a value exceeds 255, the row is normalized to keep the correct beta value, while trimming the number of observations to 255. See implementation in trim_to_utin8 here: https://github.com/nloyfer/wgbs_tools/blob/a6ba1d17498ade4b79b3a3c057457ded7ca7090d/src/python/utils_wgbs.py#L252

When the true (high) coverage does matter, I suggest using the .lbeta format. Which is identical to .beta, except that it represent values with uint16 instead of uint8. This way the limit is ~65,000, and the file sizes are ~107Mb.

To use lbeta, simply add the --lbeta flag. For example, wgbstools pat2beta --lbeta PATH_TO_PAT Most other wgbstools commands (e.g. beta_cov) can handle lbeta files implicitly.

joe448 commented 1 year ago

ah, that makes sense. Thanks a lot for the explanation and the quick response!