jts / nanopolish

Signal-level algorithms for MinION data
MIT License
568 stars 159 forks source link

How to understand eventalign samples output #717

Closed jorgcalis closed 4 years ago

jorgcalis commented 4 years ago

Dear developers/@jts,

Thanks for your great work making/maintaining nanopolish. It is a pleasure using nanopolish. I am looking at direct RNA-seq data with eventalign, and want to work with the per sample raw data. Eventually, to compare the measurements at the same reference position from different reads. I looked at the high-level explanation blog, but there are still some parts of the output that I do not fully understand:

Here bits of a single line of output. reference_kmer: CAGGT event_index: 588 event_level_mean: 104.48 event_stdv: 4.682 event_length: 0.00299 model_mean: 98.91 model_stdv: 9.52 standardized_level: 0.53 samples: 116.896,118.601,120.151,131.465,124.336,117.361,118.136,119.376,129.605

My specific questions:

Feel free to point me to some area where this has been explained before. Thanks for your help and keep up the good work!

Best, Jorg

jts commented 4 years ago

Thanks, I'm glad to hear you like the program.

Both of your questions are due to how nanopolish scales the reads to account for per-read variations in signal. Internally nanopolish scales the pore model (model_mean, model_stdv) instead of scaling the events (the reason why we do this is a bit lengthy but I'm happy to explain if it helps your work). The samples output however is scaled, which is why the mean of the samples doesn't match event_mean. If you pass the --scale-events flag both samples and event_mean will be scaled and columns will be consistent.

Similarly the standardized_level column has an additional scaling parameter (var) applied:

https://github.com/jts/nanopolish/blob/master/src/alignment/nanopolish_eventalign.cpp#L460

The best reference for how the scaling works is in the supplement of our methylation calling paper:

https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.4184/MediaObjects/41592_2017_BFnmeth4184_MOESM258_ESM.pdf

I hope this helps, let me know if you have any followup questions or if this in unclear.

Jared

jorgcalis commented 4 years ago

Dear Jared,

Thanks for a quick response.

If I understand it correctly, the scaling will work to make the samples of different reads best comparable. The unscaled values in the output are event_index, event_level_mean and event_stdv, unless you pass --scale-events, than all output will be scaled. I'll go through the reference and let you know if I have any followup questions.

Thanks again! Best, Jorg

iremtll commented 2 years ago

Dear developers,

why are there several values (samples) for one read? Is this how nanopore works?

Thanks a lot! Best