nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
146 stars 8 forks source link

Validation with ground truth #177

Open vyx-lucy-kaplun opened 6 months ago

vyx-lucy-kaplun commented 6 months ago

Hi, I have a few questions regarding the use of modkit validate. 1) Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else? And should I somehow account specifically for the CpG context?

1   chrom 
2   start position  
3   end position    
4   mod code    
6   strand  

2) Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?

chr1    600684  600686  26.923  7   19
chr1    600714  600716  38.71   12  19

3) Do we necessarily need the strand information in the truth file?

Thank you!

marcus1487 commented 6 months ago

At a high level I would note that this command is intended for the validation of samples where the modified base status at reference positions is known with absolute certainty. It is possible to run this command with any type of data, but the reported metrics represent the combination of the model performance and the validity of the ground truth data. If you have partial methylation at reference positions you may be more interested in comparing different samples or technologies with a heatmap/correlation analysis, but that is not the intended purpose of this command. Answers to your specific questions are below.

Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else?

The mod code for modified bases is as specified in the BAM tag format. So for common modified bases this will be a single letter code (m = 5mC, h = 5hmC, etc; see BAM tag spec for full list). For other bases the mod code will be the ChEBI code for the modified base (e.g. 21839 = 4mC).

For canonical positions in the reference the - character is used.

Here is an example for two modified positions:

chr1      10      11      m       0       +
chr2      50      51      m       0       +

And for canonical positions.

chr1      10      11      -       0       +
chr2      50      51      -       0       +

And should I somehow account specifically for the CpG context?

The positions in the BED file specify the positions at which you know the modified base identity with 100% certainty for all reads covering the reference position. Any such positions can be added to the BED file. If these are CpG sites then these can be added to the ground truth BED file. You can generate several BED files if you wish to validate accuracy on different subsets of reference positions (e.g. CG, CHG, and CHH contexts).

Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?

Generally this command is intended for positions where the modified base status at a reference position is known with absolute certainty. It is possible to set thresholds given an alternative technology (or even replicates of the same one), but in this case any differences due to biological variation will be represented as "errors" which may not accurately represent the performance of the modified base calls.

Do we necessarily need the strand information in the truth file?

Yes. Modified bases occur on a single strand not on both. Some sites show constitutive methylation on both strands within certain motifs (e.g. 5mC in CG contexts). In these cases both strands can be added to the ground truth BED file.

vyx-lucy-kaplun commented 6 months ago

Thank you!