Closed mbhall88 closed 6 months ago
Here are some summary statistics for the truthset of variants I have generated on our current samples
sample species dist snps insertions deletions n_variants
-------------------- ----------------------- ---------- ----- ---------- --------- ----------
ATCC_10708__202309 Salmonella enterica 0.00557238 20681 271 298 21250
ATCC_35221__202309 Campylobacter lari 0.0112643 20390 195 218 20803
ATCC_35897__202309 Listeria welshimeri 0.00860768 18262 174 185 18621
ATCC_BAA-679__202309 Listeria monocytogenes 0.00499419 12931 78 91 13100
ATCC_19119__202309 Listeria ivanovii 0.00501018 11169 221 295 11685
ATCC_25922__202309 Escherichia coli 0.00501979 8460 92 235 8787
ATCC_33560__202309 Campylobacter jejuni 0.00509036 7926 161 179 8266
ATCC_17802__202309 Vibrio parahaemolyticus 0.00483188 4059 184 259 4502
BPH2947__202310 Staphylococcus aureus 0.00393593 2212 72 81 2365
dist
is the mash distance of the donor genome from the reference for that sample.
Generating truth VCFs
There is three main ways we can do this.
Option 1 has the advantage of giving us absolute control over what variants to simulate, at what rate, what the truth variants are etc. The disadvantage is that this doesn't really simulate "real" mutational processes - e.g., genes mutating at different rates based on function and compared to intergenic regions. One tool, mutation-simulator does allow for more fine-grained simulations, but we would need to determine what the mutation rates are for each species etc.
Option 2 has the advantage of giving us more natural mutational processes. But had the downside of being a little more complicated to extract true variants, plus having to determine regions of the genome to ignore (due to unmapped regions or high variability etc.)
Option 3 is a kind of hybrid of Options 1 and 2. It has the advantage of removing ambiguity around what our truthset of variants is, while using variants that really do occur between two strains. However, the downside is you're removing some of the challenge caused by the distance between the two strains - it's still a bit artificial.
We all agree that Option 1 is a no go.
Selecting the VCF reference
The VCF reference (vcfref) would be a different strain from the same species. One way I have played with doing this selection is to download all complete genome assemblies from refseq
Create a Mash sketch of those genomes
And compute the mash distance to the reference assembly, sorting by the distance column
We then select a genome which has some distance from our original - preferably not the lowest distance. The mash distance approximates the average nucleotide identity (actually it is 1-ANI), so we could set an ANI we would like, such as 0.5% and then selecting a genome with a distance close to that.
Generating truth variants, targets, and mask
With a vcfref selected we now need to figure out what variants exist between the two genomes, what regions we want to target, and what regions we want to mask.
A very simple method here would be just use dnadiff between the two genomes and use the differences as the true variants. However, this will possibly miss variants, or even produce false positives. The way Martin approached this in varifier was to use dnadiff and minimap2 to produce two seperate sets of variants. He then makes probes out of these and maps them back to the reference, requiring perfect matching. If they don't perfectly match, he discards the variant.
This approach for truth variant generation is my preference, however, we need to take this a step or two further. Those false positive variants whose probes don't align should be output as a type of mask - i.e., we should not evaluate variants at these positions because they obviously have problems. We additionally need to identify whih=ch regions of the vcfref we allow variants in (targets). To elaborate, the vcfref is a different strain and so there will be parts of its genome which do not align with the reference assembly. We don't want to assess variants in these regions, just the regions that align. We can take this a step even further and remove regions from the targets that do not have depth 1 when aligning the vcfref and reference assembly as these are either repetitive regions, highly divered regions, or regions that exist in one genome and not the other.
A way of identifying these target regions would be to align the two genomes using asm5 (or similar) and piping this into
samtools depth
and keeping only positions with 1 in column three. Here is a way of counting the number of different depths in an alignmentthen to extract these to a bcftools-compatible targets file
from the bcftools docs
The above command will generate the tab-delimited (default) file used by bcftools. We probably should convert this to BED though for more versatility.
We can then subtract any masked regions from this file, or keep the two separate and use
bcftools filter
to keep targets and remove masked.Reasons for going with Option 3
Here we document our reasoning for deciding to go with the hybrid method for truthset generation.
Initially, we had wanted to take a reference from another strain (VCFREF), align our sample's assembly (REF) to it and then take the set of variants between them as the truthset when calling variants with respect to VCFREF.
To do this, we align VCFREF and REF using both
dnadiff
andminimap2
with the intention of taking the variants in common between the two. We usesyri
to assess the alignments and pull out the variants. This has the added advantage of also identifyin structural variation between the two genomes, and facilitates visualisation of the alignment of the genomes.Here is how the alignment and
syri
were run:The first issue that arise is the disparity in the number of variants between the two alignment methods.
This is the
syri
summary of thednadiff
alignmentAnd here is the
minimap2
summaryOf particular concern is that
dnadiff
discovers and order of magnitude more SNPs abd indels thanminimap2
.The other concern that comes about from this too is the differences between the two alignments from a structural perspective. e.g.,
dnadiff
has 49 translocations compared tominimap2
's 3. And these become apparent when we visualise the two alignments withplotsr
dnadiff alignment
minimap2 alignment
Of particular concern from these plots is that the start and end of chromosome 1 on vcfref has noticeably different alignments from
dnadiff
andminimap2
.So, in the end, we elect to go with taking the union of the variants from
dnadiff
andminimap2
and applying them to REF to create MUTREF. We will then call variants with respect to MUTREF for the analysis.I will update this issue with exactly how this process is done.