In setting the minimum cluster size (sub-heading in Section 4.3.2), we define the threshold for the number of hits in a cluster relative to the length of reads and local graphs. As these two lengths can differ from each other, we take the minimum of the two. However, we don't use the length of the read, we actually use the expected density of minimizers in a read of that length, which is 2||read||/w. (This comes from Section 3.1 of this paper).
We then account for sequencing errors by multiplying this value by the exp(-ke). Ultimately, this equation is
where l is the minimum k-mer path length of the local graph and s is the sequence (read).
This is where the assumption comes in. We assume that Nanopore reads are always longer than local graphs (expected_number_kmers_in_short_read_sketch is always set to uint_32 max for nanopore data). I appreciate that for E. coli datasets this is probably correct most of the time. But, for Mtb this is certainly not the case. I have plenty of samples with a median read length in the range 0.5-2kbp and lots of local graphs longer than that (by a lot in some cases).
From Rachel's (brilliant) thesis (emphasis mine)
When the minimum size of a cluster is set too low, we have more false positive
local graphs identified as present in the dataset, and also have to handle more noise
downstream when inferring a mosaic sequence and genotyping. When it is set too
high, we have less sensitivity to discover loci that are present.
So, I think for Nanopore datasets with low median read lengths or very long loci, we could be losing sensitivity to discover loci.
A fix for the following issue.
(Taken from https://github.com/rmcolq/pandora/issues/262#issue-806308312)
In setting the minimum cluster size (sub-heading in Section 4.3.2), we define the threshold for the number of hits in a cluster relative to the length of reads and local graphs. As these two lengths can differ from each other, we take the minimum of the two. However, we don't use the length of the read, we actually use the expected density of minimizers in a read of that length, which is
2||read||/w
. (This comes from Section 3.1 of this paper).We then account for sequencing errors by multiplying this value by the
exp(-ke)
. Ultimately, this equation iswhere l is the minimum k-mer path length of the local graph and s is the sequence (read).
In reality, we don't quite do this though. The portion of the code that calculates the number of expected hits (prior to accounting for errors) is https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/utils.cpp#L489-L498
and then later https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/utils.cpp#L223-L227
This is where the assumption comes in. We assume that Nanopore reads are always longer than local graphs (
expected_number_kmers_in_short_read_sketch
is always set to uint_32 max for nanopore data). I appreciate that for E. coli datasets this is probably correct most of the time. But, for Mtb this is certainly not the case. I have plenty of samples with a median read length in the range 0.5-2kbp and lots of local graphs longer than that (by a lot in some cases).From Rachel's (brilliant) thesis (emphasis mine)
So, I think for Nanopore datasets with low median read lengths or very long loci, we could be losing sensitivity to discover loci.
This is backed up by the resulting plots in https://github.com/rmcolq/pandora/issues/262#issuecomment-778109077