typhoidgenomics / TyphiNET

TyphiNET online Salmonella Typhi AMR surveillance dashboard.
https://www.typhi.net
GNU General Public License v3.0
7 stars 3 forks source link

Wishlist: set mininum number of genomes to include data in map #5

Closed katholt closed 2 years ago

katholt commented 3 years ago

Set a threshold (n=10? n=20?) for minimum number of genomes available for a country, in order to calculate and plot the % in the maps. Countries with <n genomes will appear as dark grey = not known.

lcerdeira commented 3 years ago

Currently, the app calculates based on all values. I believe set a minimum threshold will not reflect the real %.

katholt commented 3 years ago

When reporting percentages it is very important to know that the total sample size N from which it was calculated is large enough to make the value meaningful. Otherwise our percentages can be quite random and uninformative.

In epidemiology, a proportion is an estimate of the true proportion in the population; the accuracy of this estimate is dependent on the sample size. Indeed we can get a sense of this accuracy by calculating the 95% confidence interval of the proportion.

E.g. If we observe 2 CipR strains in country X, it matters hugely what the total sample size was... if we only have 2 samples from country X, the percentage is 2/2=100%... but 2 samples is not enough evidence to infer that the frequency of CipR in that location is 100%. In fact the 95% confidence interval for the frequency estimate is [34% - 100%].

If we have 200 samples from country X and 200 are CipR, then our estimated proportion of resistance is 200/200=100% and we have a lot of confidence in this estimate because it is based on a lot of data. In this case the 95% confidence interval for the frequency estimate is [98% - 100%].

N=20 is generally considered a minimal sample size for estimating a proportion. E.g. p = 10/20 = 50%, with 95% confidence interval [30-70%]. This is still a very wide interval.

So - for countries with N<20 genomes, we need to report the resistance frequencies as unknown (colour grey); not simply calculate p=n/N.

lcerdeira commented 3 years ago

Thanks for clarifying and detail 'N<20 genomes'.