Closed joe448 closed 1 year ago
Right. This is the beta
format's weak point.
To keep the beta
files small (~54Mb), we represent all ~56M values (number of methylated observations, number of observations for each CpG) with uint8 (i.e. 8 bits). This means the maximal value in a beta file is 255. Usually it's enough (for WGBS data), but sometimes it's not.
When a value exceeds 255, the row is normalized to keep the correct beta value, while trimming the number of observations to 255. See implementation in trim_to_utin8
here:
https://github.com/nloyfer/wgbs_tools/blob/a6ba1d17498ade4b79b3a3c057457ded7ca7090d/src/python/utils_wgbs.py#L252
When the true (high) coverage does matter, I suggest using the .lbeta
format. Which is identical to .beta
, except that it represent values with uint16 instead of uint8. This way the limit is ~65,000, and the file sizes are ~107Mb.
To use lbeta
, simply add the --lbeta
flag. For example, wgbstools pat2beta --lbeta PATH_TO_PAT
Most other wgbstools
commands (e.g. beta_cov
) can handle lbeta
files implicitly.
ah, that makes sense. Thanks a lot for the explanation and the quick response!
I saw that the WBC-WGBS (n=23) samples were uploaded. When trying to merge all 23 of them with: wgbstools merge -T tmp -p WBC-merged *-WBC-WGBS-Rep1.pat.gz it runs, generates a tmp file with several Gb and finishes writing WBC-merged.pat.gz without error. However if I check the coverage with beta_cov after generating a .beta from WBC-merged.pat.gz I get _WBCmerged 238.43 and e.g. _GSM6810026CNVS-NORM-110000263-WBC-WGBS-Rep1 82.86
Shouldn't it be much higher (>1'000x) considering 23 samples with a coverage of 50-80x?