Closed ebioman closed 7 years ago
Good question. If CpG sites are evenly spaced at every 50 bp, 2,000 bp hypomethylated region will consist of 40 CpG sites. Here, we ignored the possible difference between epigenetic structures of human and fish. Selection of this number, say, 50, should be also based on how much read depth you have, as mentioned in the paper (please see supplemental figures too). With enough reads, we achieved the best performance with setting this to be 35. Regarding other species, if the expected number of CpG sites in the hypomethylated regions are available, you can use this number as a baseline, and then tune it according to the read depth.
Hi
Selection of this number, say, 50, should be also based on how much read depth you have, as mentioned in the paper (please see supplemental figures too). With enough reads, we achieved the best performance with setting this to be 35
I am curious, I found that the theoretical coverage (e.g. roughly 1 Gbp/cell then divided by genome size) would normally results in a close guess to the final observations in other PacBio related analysis. Here this seems different - or at least in the coverage wiggles file. It is there much lower than expected. E.g. I have 152 smrtcells adding up a theoretical coverage of >70x. So I thought that I could definetely head for l=35. But then I am looking at the coverage file and it seems more to be ~20x in non-repeated regions. So that would rather suggest then l=50?
This wiggle file reports coverage on each strand basis, just as PacBio's modifications.csv reports information on each strand separately. Actually, both file shows # of IPDs used for calculating the statistics, so it may slightly different (smaller) than the expected coverage. Information on two starnds are then combined appropriately, the number should be doubled for comparison with other data. That means practically you have ~40x data in this case, so you should use l=35.
Thanks, that explains a lot Cheers
Hi I have a question regarding the adaption for non-human and non-fish genomes. In your paper I found the following paragraph:
This confuses me a little bit, how is the length of hypomethylated regions in human linked to the average distance between neighboring sites in the medaka genome ?
So we have the region A for which with a sliding window of 21bp an IPDR vector is calculated. Then there should be at least 50 CpG sites , this I understood.
But how is this decision based on the fact that human haves regions of average 2kbp and furthermore the fact that the average distance between CpG sites is roughly 50 bp in medaka?
I guess more importantly, is then the question how I can adapt this e.g. for a rodent and which prior Information do I need ?