mbhall88 / NanoVarBench

Evaluating Nanopore-based bacterial variant calling
https://doi.org/10.1101/2024.03.15.585313
MIT License
13 stars 0 forks source link

Divergence threshold for variant donor #3

Closed mbhall88 closed 6 months ago

mbhall88 commented 8 months ago

Related to #1, one important parameter is how close we want the genome that is "donating" variants to be in terms of ANI.

As a refresher, the process is we download all refseq assemblies for the species of a sample and generate a mash distance matrix from our sample's assembly to all of the species assemblies. Mash distance approximates 1-ANI, so a distance of 0.005 is ANI ~99.5%.

This will obviously also control how many variants we generate in our truth set too. 0.005 seemed like a reasonable first pass to me? Does anyone have any thoughts on different thresholds?

This paper has some interesting findings relating to a "gap" in ANI values between pairs of the same species in the range 99.2-99.8% ANI. 0.005 falls smack bang in the middle of this, but we take "the closest" distance to this, so even if there's a gap, we'd go to the first genome either side of this gap.

mbhall88 commented 8 months ago

The paper reference above is very timely for this issue. I am in the process of trying to create the truth variants and mutated reference. In some of the species, we either get the closest genome to 0.005 mash distance as 0.0001 or 0.01. With the 0.0001 distance, we only get around 10 SNPs and even less indels. But if we select a genome with 0.01 distance, we get upwards of 20,000 SNPs and a few thousand indels. The 0.0001 distance is obviously too few variants, but do we think 20,000 variants is too many? The alternate approach if others feel 20,000 is too many would be to subsample the VCF to a fixed number of SNPs and indels

mbhall88 commented 8 months ago

After some discussion, we decided that 20,000 variants is probably okay. We will revisit this though if it becomes a problem. If it is a problem, we can just randomly subsample the VCF.

So if I set the minimum mash distance to something like 0.002 and ask for the closest distance to 0.005 we are guaranteed to get a good amount of variants then. The species where there was a distance ~0.005 distance had 5,000-10,000 SNPs