How to prepare data for WGBS data ？

hmyh1202 commented 1 year ago

How to prepare data for WGBS data ， Thank you!

assafgrw commented 1 year ago

I am wondering regarding the same issue. I guess that data should be mapped to a bed file containing al the cpgs in the human genome. @hmyh1202 did you find an answer?

nloyfer commented 1 year ago

Hi, I would use wgbstools as follows:

download and install wgbstools
run wgbstools bam2pat to create beta files from the bam files
run wgbstools beta_to_450k to generate a csv table compatible with meth_atlas.

But if you don't want to install wgbstools, you don't have to. You can call the methylation values from the bam files with your favorite methylation caller (e.g. bismark, MethylDackel, etc.) to create a bed file for all CpG sites. Then, merge the output (by chromosome and position) with the Illumina Infinium HumanMethylation450K manifest, e.g. from here, to map the CpGs to their corresponding 450K annotations.

assafgrw commented 1 year ago

Thank for the prompt replay @nloyfer Actually what I ment in my question is: How to perpre my data (ONT) for deconv using the WGBS based atlas you recently publish (Nature 2023)?

As far as I undersatnd wgbstools is for bs data only?

I was thinking to do as in https://github.com/methylgrammarlab/cfdna-ont but am wondering what to map my data to? a bed file containing all the CpGs in the Genome? a bed file containing the "top 1% (20,997) blocks" (where can I find this bed file, it's not in the repo?) ? And finally which Atlas version shel I use with "deconvolve.py? there are few in the supplementy section"

nloyfer commented 1 year ago

I am adding ONT support for wgbstools. Meaning, you could run wgbstools bam2pat on an ONT bam file (a bam file with the MM/ML fields) and generate a pat file. I finished implementing a beta version, and will push it to the wgbstools repo soon. And then you could run the UXM deconvolution, which is generally better than meth_atlas.

When using meth_atlas, you are limited to 450K/EPIC array sites, mostly because the atlases supplied in this repo (and the Nat. Comm paper) were array based. This is why I suggested reducing your whole-genome data to "450K" csv files. As we moved from array to whole genome (in the nature paper), we also changed our deconvolution method, to be based on fragment-level information (UXM).

assafgrw commented 1 year ago

Great, tahnk you, that exactly what I needed, waiting for this update. In the mean time you dont think that I can map my data to "Top 1% of variable blocks of >=4 CpGs (hg19)" (Table S2 i n the paper) and use it with meth_atalas with the new atlas (Atlas.U250.l4.hg19.full.tsv) as referens?

nloyfer commented 1 year ago

You mixed two reasonable options. You have to choose one set of regions/markers to work with. Either it's the ~20,000 blocks listed in Table S2, or the ~10,000 blocks (markers) listed in Atlas.U250.l4.hg19.full.tsv. The latter will work better in my opinion. Once you choose a set of regions, you need to build a reference atlas, i.e. 39 columns (or less), one for each cell type, with the methylation values of that cell type. (The <i,j> cell in the atlas table holds the methylation value of the marker i in cell type j). I recommend you do so by running wgbstools beta_to_table with the --groups_file flag, and the beta files we published in GSE186458. You shouldn't directly use the methylation values we published in Atlas.U250.l4.hg19.full.tsv, since they represent the "UXM score" (proportion of unmethylated reads), rather than the average methylation values.

assafgrw commented 1 year ago

Thank you Netanel,

I will try that, but was wondering, as far as I understand generating an Atlas ith beta_to_table will not include a feature selection step and all rows (Cpg clusters) will be included in the final Atlas. Am I right? would that yield fair deconv?

On Thu, May 4, 2023 at 9:02 PM Netanel Loyfer @.***> wrote:

You mixed two reasonable options. You have to choose one set of regions/markers to work with. Either it's the ~20,000 blocks listed in Table S2, or the ~10,000 blocks (markers) listed in Atlas.U250.l4.hg19.full.tsv. The latter will work better in my opinion. Once you choose a set of regions, you need to build a reference atlas, i.e. 39 columns (or less), one for each cell type, with the methylation values of that cell type. (The <i,j> cell in the atlas table holds the methylation value of the marker i in cell type j). I recommend you do so by running wgbstools beta_to_table with the --groups_file flag, and the beta files we published in GSE186458 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE186458. You shouldn't directly use the methylation values we published in Atlas.U250.l4.hg19.full.tsv, since they represent the "UXM score" (proportion of unmethylated reads), rather than the average methylation values.

— Reply to this email directly, view it on GitHub https://github.com/nloyfer/meth_atlas/issues/8#issuecomment-1535192461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHP7JFGAOBI5U6KPE32IWKLXEPVMLANCNFSM6AAAAAATWENEVY . You are receiving this because you commented.Message ID: @.***>

assafgrw commented 1 year ago

Hi Netanel,

I hope that you are doing well,

I am really looking forward to the wgbs version to support BAM files generated by ONT. I was wondering if you are planning to upload this version soon?

Thanks and shana-tova Assaf

On Thu, May 4, 2023 at 2:59 PM Netanel Loyfer @.***> wrote:

I am adding ONT support for wgbstools. Meaning, you could run wgbstools bam2pat on an ONT bam file (a bam file with the MM/ML fields) and generate a pat file. I finished implementing a beta version, and will push it to the wgbstools repo soon. And then you could run the UXM deconvolution, which is generally better than meth_atlas.

When using meth_atlas, you are limited to 450K/EPIC array sites, mostly because the atlases supplied in this repo (and the Nat. Comm paper) were array based. This is why I suggested reducing your whole-genome data to "450K" csv files. As we moved from array to whole genome (in the nature paper), we also changed our deconvolution method, to be based on fragment-level information (UXM https://github.com/nloyfer/UXM_deconv/).

— Reply to this email directly, view it on GitHub https://github.com/nloyfer/meth_atlas/issues/8#issuecomment-1534644269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHP7JFBSTXJ3SOWDF2N7SWLXEOKZBANCNFSM6AAAAAATWENEVY . You are receiving this because you commented.Message ID: @.***>

shaohuaihan commented 1 year ago

Hi，

Finally I used methylCC R package.

sapply(c("tidyverse", "wesanderson", "openxlsx", "rlist", "FlowSorted.CordBlood.450k", "IlluminaHumanMethylation450kmanifest", "IlluminaHumanMethylation450kanno.ilmn12.hg19", "genefilter", "rtracklayer", "AnnotationHub", "minfi", "bsseq", "methylCC"), require, character.only = TRUE)

read CpG info of BSseq Object file

print("Load BSseq Objects") bs.filtered.bsseq <- readRDS("Filtered_BSseq.rds")

print("liftOver human T2T to hg19") mcols(bs.filtered.bsseq)$index <- 1:length(bs.filtered.bsseq) hs1 <- rowRanges(bs.filtered.bsseq) hs1$index <- 1:length(hs1) ch = import.chain("chm13-hg19.over.chain") # liftover file of UCSC hg19 <- liftOver( hs1, chain = ch ) %>% unlist()

Subset based on regions that lifted over

bs.filtered.bsseq.hg19 <- bs.filtered.bsseq[which(hg19$index %in% hs1$index)] rowRanges(bs.filtered.bsseq.hg19) <- hg19

glue::glue("{length(bs.filtered.bsseq.hg19)} out of {length(bs.filtered.bsseq)} were lifted over") glue::glue("{length(bs.filtered.bsseq) - length(bs.filtered.bsseq.hg19)} did not liftOver")

bs.filtered.bsseq.hg19

CC <- bs.filtered.bsseq.hg19 %>% methylCC::estimatecc(include_cpgs = TRUE, include_dmrs = TRUE)

est <- cell_counts(CC)

write.table(est, file = "Cell_est.xls", sep = "\t", quote = FALSE, row.names = TRUE)

The results like this below: Sample Gran CD4T CD8T Bcell Mono NK sample_1 0.093594 0.2084 0.12925 0.19947 ...

some notes may be helpfull:

https://github.com/stephaniehicks/methylCC/issues/3 https://bioconductor.org/packages/release/bioc/vignettes/methylCC/inst/doc/methylCC.html#2_Getting_Started

nloyfer / meth_atlas

How to prepare data for WGBS data ？ #8

read CpG info of BSseq Object file

Subset based on regions that lifted over