Open vyx-lucy-kaplun opened 6 months ago
At a high level I would note that this command is intended for the validation of samples where the modified base status at reference positions is known with absolute certainty. It is possible to run this command with any type of data, but the reported metrics represent the combination of the model performance and the validity of the ground truth data. If you have partial methylation at reference positions you may be more interested in comparing different samples or technologies with a heatmap/correlation analysis, but that is not the intended purpose of this command. Answers to your specific questions are below.
Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else?
The mod code for modified bases is as specified in the BAM tag format. So for common modified bases this will be a single letter code (m
= 5mC, h
= 5hmC, etc; see BAM tag spec for full list). For other bases the mod code will be the ChEBI code for the modified base (e.g. 21839
= 4mC).
For canonical positions in the reference the -
character is used.
Here is an example for two modified positions:
chr1 10 11 m 0 +
chr2 50 51 m 0 +
And for canonical positions.
chr1 10 11 - 0 +
chr2 50 51 - 0 +
And should I somehow account specifically for the CpG context?
The positions in the BED file specify the positions at which you know the modified base identity with 100% certainty for all reads covering the reference position. Any such positions can be added to the BED file. If these are CpG sites then these can be added to the ground truth BED file. You can generate several BED files if you wish to validate accuracy on different subsets of reference positions (e.g. CG, CHG, and CHH contexts).
Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?
Generally this command is intended for positions where the modified base status at a reference position is known with absolute certainty. It is possible to set thresholds given an alternative technology (or even replicates of the same one), but in this case any differences due to biological variation will be represented as "errors" which may not accurately represent the performance of the modified base calls.
Do we necessarily need the strand information in the truth file?
Yes. Modified bases occur on a single strand not on both. Some sites show constitutive methylation on both strands within certain motifs (e.g. 5mC in CG contexts). In these cases both strands can be added to the ground truth BED file.
Thank you!
Hi, I have a few questions regarding the use of modkit validate. 1) Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else? And should I somehow account specifically for the CpG context?
2) Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?
3) Do we necessarily need the strand information in the truth file?
Thank you!