hacone / AgIn

Process SMRT sequencing kinetic summary to predict regional methylation on large genome
12 stars 7 forks source link

Adaption for non-human/fish genome #5

Closed ebioman closed 7 years ago

ebioman commented 8 years ago

Hi I have a question regarding the adaption for non-human and non-fish genomes. In your paper I found the following paragraph:

In our analysis, we imposed the constraint that each region contained at least b CpG sites. For example, we can set b to 50 because the average length of hypomethylated regions in five human cell types is approximately 2kb (Su et al., 2012) and the average distance between neighboring CpG sites in the medaka genome is 53.5 bases, although this constraint should be adjusted according to each individual situation.

This confuses me a little bit, how is the length of hypomethylated regions in human linked to the average distance between neighboring sites in the medaka genome ?

So we have the region A for which with a sliding window of 21bp an IPDR vector is calculated. Then there should be at least 50 CpG sites , this I understood.

But how is this decision based on the fact that human haves regions of average 2kbp and furthermore the fact that the average distance between CpG sites is roughly 50 bp in medaka?

I guess more importantly, is then the question how I can adapt this e.g. for a rodent and which prior Information do I need ?

hacone commented 8 years ago

Good question. If CpG sites are evenly spaced at every 50 bp, 2,000 bp hypomethylated region will consist of 40 CpG sites. Here, we ignored the possible difference between epigenetic structures of human and fish. Selection of this number, say, 50, should be also based on how much read depth you have, as mentioned in the paper (please see supplemental figures too). With enough reads, we achieved the best performance with setting this to be 35. Regarding other species, if the expected number of CpG sites in the hypomethylated regions are available, you can use this number as a baseline, and then tune it according to the read depth.

ebioman commented 7 years ago

Hi

Selection of this number, say, 50, should be also based on how much read depth you have, as mentioned in the paper (please see supplemental figures too). With enough reads, we achieved the best performance with setting this to be 35

I am curious, I found that the theoretical coverage (e.g. roughly 1 Gbp/cell then divided by genome size) would normally results in a close guess to the final observations in other PacBio related analysis. Here this seems different - or at least in the coverage wiggles file. It is there much lower than expected. E.g. I have 152 smrtcells adding up a theoretical coverage of >70x. So I thought that I could definetely head for l=35. But then I am looking at the coverage file and it seems more to be ~20x in non-repeated regions. So that would rather suggest then l=50?

hacone commented 7 years ago

This wiggle file reports coverage on each strand basis, just as PacBio's modifications.csv reports information on each strand separately. Actually, both file shows # of IPDs used for calculating the statistics, so it may slightly different (smaller) than the expected coverage. Information on two starnds are then combined appropriately, the number should be doubled for comparison with other data. That means practically you have ~40x data in this case, so you should use l=35.

ebioman commented 7 years ago

Thanks, that explains a lot Cheers