Closed PanZiwei closed 9 months ago
Hi - unfortunately this is not possible with RLM as the implementation requires a BAM or SAM file. If you would want to calculate the entropy yourself you would need to summarize the read-wise information per 4-mer of consecutive CpGs, which is what RLM does internally. I would unfortunately not consider to add this option to RLM as the application also pre-filters the BAM file before summarizing the information per 4-mer to exclude reads with certain properties like mismatches at CpG positions or indels that would affect the entropy calculation. This would not be possible when starting with a BED file.
Thanks for the quick response! I have 2 quick questions about the 4-mer calculation for RLM calculation:
I provided a quick example below.
Sequence(+): -CG---CG---CG---CG---
Sequence(-): -GC---GC---GC---GC---
Read1(+): -C----C----N----C----
Read2(-): --C----C----C----C---
Thank you so much for your help!
Regarding 1: Currently the read (in your example read 1) is discarded if it has a mismatch at any CpG position even though this is a bit aggressive and I have been thinking about implementing a strategy that would allow partial read usage. On the other hand, it is rather rare that more than 4 CpGs fit on a read and then in a way that the missing value would not affect (almost) all 4-mers the read can span. So practically it might not even be possible to rescue a lot of reads. Coming back to your example, due to the missing value in this 4-mer it would not be possible to compute the entropy including read 1, so RLM removes it.
Regarding 2: Yes, the strand information is merged, i.e. read 1 and read 2 are considered together for the entropy calculation at these 4 CpGs.
Thank you so much for the detailed reply! Really helpful. I will close the ticket then.
Sorry for reopening the ticket. One more question about the 4-kmer, do you move 1 CpG for 4-mer? Does the 4-mer overlap with each other?
For example, in a specific region containing 8 CpG sites, does you count 2 4-mer(non-overlapped) or 5 4-mer(move 1 CpG every time)?
Sequence(+): -CG-CG-CG-CG-CG-CG-CG-CG-
They are overlapping, so in your example it would be 5. RLM outputs for each k-mer the starting CpG coordinate when the entropy is calculated.
Hi, Thanks for the amazing tool. I already have per-read methylation results in BED format with chromosome, start position, end position, read name and methylation frequency for each CpG site.
Is it possible to use the BED file as
RLM
input for to calculate entropy if possible? Any hints for the implementation? Thanks.