hacone / AgIn

Process SMRT sequencing kinetic summary to predict regional methylation on large genome
12 stars 7 forks source link

Nonsensical Output #13

Open justinjohns opened 5 years ago

justinjohns commented 5 years ago

Hi, I have gotten AgIn to run smoothly with the examples, but when I use my own data the class.wig output marks everything has methylated, which makes me think I have made a mistake.

I have a cmp.h5 file (P6C4 chemistry) mapped to one chromosome (Super-Scaffold_9_Super-Scaffold_99__chr18). I have the reference (Ccornix5.5.fasta), and a corresponding .fai file

I first generate a modifications file for AgIn with ipdSummary ~/Super-Scaffold_9_Super-Scaffold_99__chr18.sorted.cmp.h5 --reference ~/Ccornix5.5.fasta --identify m5C_TET --methylFraction --gff ~/5mC_WG.gff --csv ~/5mC_WG.csv

And run AgIn with

~/AgIn-0.9/target/dist/bin/launch -i ~/5mC_WG.csv -f ~/Ccornix5.5.fasta -o ~/agin_out/chr18 -b ~/AgIn-0.9/resource/P6C4.dat -l 40 -g -0.55 predict

I'm wondering if you can quickly spot if my issue is with ipdSummary and I should re-evaluate this step, or if something funky is going on with AgIn.

Attached a head of the modifications.csv I'm using, as well as my output.

5mC_WG.csv.txt chr18.gff.txt chr18_class.wig.txt chr18_coverage.wig.txt

Justin

hacone commented 5 years ago

Excuse me for this delayed reply.

Firstly I thought something would be wrong with AgIn because I know it doesn't perform well with the latest Sequel platform and was tested only against certain vertebrates (a fish & human) where majority of CpGs are methylated by default, so AgIn could be a bit biased towards methylation by design to perform best on such samples.

But now I've realized your data is from P6-C4. I've checked your input, they seem OK, and it looks to me that methylation is real as IPD ratios in the CSV file are greatly deviated from neutral value of 1.0 around CpG sites considering that you have enough sequencing depth on them.

Please note that AgIn can only predicts methylation statuses as regional annotation, so, the fact that it reported only methylated regions means it could not detect a certain length of clusters of unmethylated CpGs, and it does not mean every single CpG should be methylated actually.

To capture finer fluctuation of methylation statuses, you may want to try smaller l, say, 30 or 20. Judging from the CSV file, it seems to me that you have enough sequencing depth to gain finer resolution while retaining accuracy (I confirmed l=40 worked well for 15x per strand data).

Let me know if you have further question about AgIn. (sorry that it is not maintained well recently...)

justinjohns commented 5 years ago

Thanks for your insight! We have WGBS data arriving in the next few weeks to pair with our P6-C4 individuals, which will hopefully clarify if these signals are correct! Glad to know it's working as intended.

Justin