nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
117 stars 6 forks source link

Probability threshold #80

Open NickNCL opened 7 months ago

NickNCL commented 7 months ago

Hi,

I wanted to ask about the rationale of empirically determined filtering threshold by default? Surely it would make more sense for this to be fixed e.g. between different replicates/conditions. Would date be comparable between replicates if the filtering threshold varies?

I'm getting a warning that the selected threshold by default is low

Threshold of 0.5722656 for base A is very low. Consider increasing the filter-percentile or specifying a higher threshold. Threshold of 0.6074219 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.

I'm presuming this is because I have generally low probability scores, maybe due to the older nanopore sequencing chemistry I'm using (V10) and possibly also the basecalling model I used (all_context modification model)

I'm hoping that high sequencing depth might, to a certain extend, compensate for a higher error rate

Is there any guidance on a generally acceptable threshold?

marcus1487 commented 7 months ago

I wanted to ask about the rationale of empirically determined filtering threshold by default? Surely it would make more sense for this to be fixed e.g. between different replicates/conditions. Would date be comparable between replicates if the filtering threshold varies? For larger studies using the same model across several samples it would certainly be best to select a constant threshold to apply across samples. Data should still be comparable when filtering varies across samples, but may cause complications in some situations. The default of filtering 10% of the data simply provides a better default value across different modified base models and sample types. This is also the threshold used to report all of our modified base accuracy metrics so applying this threshold should reproduce these accuracy metrics.

For the low threshold on the older model, this makes sense. The best recommendation here would be to upgrade to the latest kits where the accuracy (and confidence in each call) is much higher. For higher accuracy from these older models you may have some success increasing the filtering threshold, but this will obviously drop some data as well.

Is there any guidance on a generally acceptable threshold? The guidance for one sample analysis would be to use the default filtering threshold of 10% of the data. The recommendation for studies with more samples would be to estimate the threshold from all samples (or a uniform sampling over all samples/conditions) and apply this threshold across the samples. Some care may need to be taken here to avoid including a sample with drastically lower quality than the rest which would drive down the threshold. For example if one sample were completely modified (not an intended target for modified base models), you may observe globally lower modified base confidence values. It might be best to remove these types of samples for threshold determination. Essentially the recommendation for more complex samples is to explore the data and apply thresholds as best fits the desired analysis. A global recommendation in these settings is not really feasible.

NickNCL commented 7 months ago

Thanks for the advice

Hopefully can move on to the newer kits soon we just have a bunch of older flow cells the boss wants me to use up :)

While I understand your statement that it's not feasible to provide a global threshold (always the way with bioinformatics...), the warnings within the software about low threshold seem to indicate that certain thresholds are not advisable. This makes me think that I should be manually upping it a little bit at least

ArtRand commented 7 months ago

@NickNCL One thing you could do is run

mkdir -p ${out_dir}
modkit sample-probs ${your_bam} --hist --out-dir ${out_dir}`

Then inspect probabilities.txt and probabilities.tsv and set the threshold manually with

# in pileup
--filter-threshold A:{canonical_A_threshold}
--mod-threshold a:{6mA_threshold}
--filter-threshold C:{canonical_C_threshold}
--mod-threshold m:{5mC_threshold}
Ge0rges commented 5 days ago

@ArtRand would it make sense to concatenate one's samples into a single BAM and run sample-probs assuming there are not outlying samples?

ArtRand commented 5 days ago

Hello @Ge0rges,

You can, there is a one-liner below. Do you want to know the distribution of probabilities when the samples are combined? Using a combination of sample-probs and extract are useful for exploring the base modification probabilities.

If you're trying to estimate a pass-threshold you don't combine the samples together unless you're planning on operating on them all together. The estimator in modkit will attempt to find the pass threshold such that the 10% lowest confidence calls are filtered out. Typically you do this on a per-sample basis and produce artifacts from each sample. If you combine the samples together, you can still filter out the 10% lowest confidence calls, but it is possible that you won't remove calls from each of the constituent samples evenly. Maybe this is fine, it depends on your use case.

# the sort by read name is a way to pseudo randomize them
samtools merge ${bam1} ${bam2} -o - | samtools sort -n | ${modkit} sample-probs - --hist -o probs