sarahet / RLM

Read level DNA methylation analysis of bisulfite converted sequencing data
BSD 3-Clause "New" or "Revised" License
18 stars 3 forks source link

BED input for entropy calculation? #7

Closed PanZiwei closed 7 months ago

PanZiwei commented 7 months ago

Hi, Thanks for the amazing tool. I already have per-read methylation results in BED format with chromosome, start position, end position, read name and methylation frequency for each CpG site.

Is it possible to use the BED file as RLM input for to calculate entropy if possible? Any hints for the implementation? Thanks.

sarahet commented 7 months ago

Hi - unfortunately this is not possible with RLM as the implementation requires a BAM or SAM file. If you would want to calculate the entropy yourself you would need to summarize the read-wise information per 4-mer of consecutive CpGs, which is what RLM does internally. I would unfortunately not consider to add this option to RLM as the application also pre-filters the BAM file before summarizing the information per 4-mer to exclude reads with certain properties like mismatches at CpG positions or indels that would affect the entropy calculation. This would not be possible when starting with a BED file.

PanZiwei commented 7 months ago

Thanks for the quick response! I have 2 quick questions about the 4-mer calculation for RLM calculation:

  1. How do you deal NA site on a single read for the 4-mer calculation?
  2. Do you merge the strand information for the CpG site?

I provided a quick example below.

Sequence(+): -CG---CG---CG---CG---
Sequence(-): -GC---GC---GC---GC---
Read1(+):    -C----C----N----C----
Read2(-):    --C----C----C----C---

Thank you so much for your help!

sarahet commented 7 months ago

Regarding 1: Currently the read (in your example read 1) is discarded if it has a mismatch at any CpG position even though this is a bit aggressive and I have been thinking about implementing a strategy that would allow partial read usage. On the other hand, it is rather rare that more than 4 CpGs fit on a read and then in a way that the missing value would not affect (almost) all 4-mers the read can span. So practically it might not even be possible to rescue a lot of reads. Coming back to your example, due to the missing value in this 4-mer it would not be possible to compute the entropy including read 1, so RLM removes it.

Regarding 2: Yes, the strand information is merged, i.e. read 1 and read 2 are considered together for the entropy calculation at these 4 CpGs.

PanZiwei commented 7 months ago

Thank you so much for the detailed reply! Really helpful. I will close the ticket then.

PanZiwei commented 7 months ago

Sorry for reopening the ticket. One more question about the 4-kmer, do you move 1 CpG for 4-mer? Does the 4-mer overlap with each other?

For example, in a specific region containing 8 CpG sites, does you count 2 4-mer(non-overlapped) or 5 4-mer(move 1 CpG every time)?

Sequence(+): -CG-CG-CG-CG-CG-CG-CG-CG-
sarahet commented 7 months ago

They are overlapping, so in your example it would be 5. RLM outputs for each k-mer the starting CpG coordinate when the entropy is calculated.