iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Don't assume ONT reads are longer than loci #265

Closed mbhall88 closed 3 years ago

mbhall88 commented 3 years ago

A fix for the following issue.

(Taken from https://github.com/rmcolq/pandora/issues/262#issue-806308312)

In setting the minimum cluster size (sub-heading in Section 4.3.2), we define the threshold for the number of hits in a cluster relative to the length of reads and local graphs. As these two lengths can differ from each other, we take the minimum of the two. However, we don't use the length of the read, we actually use the expected density of minimizers in a read of that length, which is 2||read||/w. (This comes from Section 3.1 of this paper).
We then account for sequencing errors by multiplying this value by the exp(-ke). Ultimately, this equation is

Screenshot from 2021-02-11 20-58-35

where l is the minimum k-mer path length of the local graph and s is the sequence (read).

In reality, we don't quite do this though. The portion of the code that calculates the number of expected hits (prior to accounting for errors) is https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/utils.cpp#L489-L498

and then later https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/utils.cpp#L223-L227

This is where the assumption comes in. We assume that Nanopore reads are always longer than local graphs (expected_number_kmers_in_short_read_sketch is always set to uint_32 max for nanopore data). I appreciate that for E. coli datasets this is probably correct most of the time. But, for Mtb this is certainly not the case. I have plenty of samples with a median read length in the range 0.5-2kbp and lots of local graphs longer than that (by a lot in some cases).

From Rachel's (brilliant) thesis (emphasis mine)

When the minimum size of a cluster is set too low, we have more false positive local graphs identified as present in the dataset, and also have to handle more noise downstream when inferring a mosaic sequence and genotyping. When it is set too high, we have less sensitivity to discover loci that are present.

So, I think for Nanopore datasets with low median read lengths or very long loci, we could be losing sensitivity to discover loci.

This is backed up by the resulting plots in https://github.com/rmcolq/pandora/issues/262#issuecomment-778109077