"Count over time" vs "incidence rate"

jakobschumacher commented 5 years ago

I am not too sure, but I feel that the term incidence is used in a wrong way in this package. The way I understand it, at the moment we measure "counts over time" (e.g. 5 cases in week X). But incidence rate is usually measured as count over time over population (e.g. 5 cases per 100 person-days). See this page for a longer explanation. So I feel we would somehow need to include a way to include the population at risk into some functions. Maybe I am totally on the wrong track, please dont hesitate to point out, if I am thinking in the wrong way.

Example of a cohort with changing populations

One difficult example in my eyes would be a constantly changing population as in centers for foreigners (as the outbreak that Savina was working on).

# Simulate population with varicella in center for foreigners
set.seed(15)
x = 200

# load name generator package
library(randomNames)

# Create random personal data
data <- data.frame(
                   name = randomNames(x, return.complete.data = TRUE),
                   age = round(runif(x, min = 0, max = 35),0)
)

# Add centers with arrival and leave date
data$center = factor(sample(c("Seestr", "Bizetstr", "Tempelhof"), size = x, replace = TRUE, prob = c(.7, .1, .2)))
data$arrival = sample(16436:16800, size = x, replace = TRUE)
data$leave = data$arrival + sample(1:120, size = x, replace = TRUE)
data$leave <- ifelse(data$leave > 16800, 16800, data$leave) # remove 
data$arrival <- as.Date(data$arrival, origin = "1970-01-01")
data$leave <- as.Date(data$leave, origin = "1970-01-01")

# Lets vaccinate some people
data$vaccinated <- sample(c("yes", "no"), size = x, replace = TRUE, prob = c(.3, .7))

# Throw in a few background cases
data$case = ifelse(data$vaccinated == "no", sample(c("case", "noCase"), size = x, replace = TRUE, prob = c(.1, .9)),"noCase") 
data$onset = ifelse(data$case == "case", sample(16436:16800, size = x, replace = TRUE), NA)

# Generate an outbreak in one center
dateOfInfection = as.Date(16690, origin = "1970-01-01")
data$case = ifelse(data$center == "Seestr" &
                   data$arrival < dateOfInfection &
                   data$leave > dateOfInfection &
                   data$vaccinated == "no", 
                 "case", data$case)
data$onset = ifelse(data$center == "Seestr" &
                    data$arrival < dateOfInfection &
                    data$leave > dateOfInfection &
                    data$vaccinated == "no",
                  rnorm(n = x, mean = (dateOfInfection + 14), sd = 3),
                  data$onset)
data$onset <- as.Date(data$onset, origin = "1970-01-01")

Computing incidence rates for the above example

With the above simulated outbreak the incidences rates can be calculated like as shown below. This would best be done in a clever function...

# Computing cumulative incidence and incidence rate by hand

# cumulative incidence over one year per 10 000 people
nrow(data[data$case == "case",] ) * 10000 / nrow(data[data$vaccinated == "no",])
#> [1] 1608.392

# Global incidence rate per 100 people-days
nrow(data[data$case == "case",]) * 100 / as.numeric(sum(data$leave[data$vaccinated == "no"] - data$arrival[data$vaccinated == "no"]))
#> [1] 0.2922119

# Incidence rate per 100 people-days at the center "Seestr"
cut_data <- data[data$vaccinated == "no" & data$center == "Seestr",]
nrow(cut_data[cut_data$case == "case",]) * 100 / as.numeric(sum(cut_data$leave -  cut_data$arrival))
#> [1] 0.4032258

# Incidence rate per 100 people-days at the center "Tempelhof"
cut_data <- data[data$vaccinated == "no" & data$center == "Tempelhof",]
nrow(cut_data[cut_data$case == "case",]) * 100 / as.numeric(sum(cut_data$leave -  cut_data$arrival))
#> [1] 0.1109878

# Incidence rate per 100 people-days for males
cut_data <- data[data$vaccinated == "no" & data$name.gender == 1,]
nrow(cut_data[cut_data$case == "case",]) * 100 / as.numeric(sum(cut_data$leave -  cut_data$arrival))
#> [1] 0.321322

# Incidence rate per 100 people-days for females
cut_data <- data[data$vaccinated == "no" & data$name.gender == 0,]
nrow(cut_data[cut_data$case == "case",]) * 100 / as.numeric(sum(cut_data$leave -  cut_data$arrival))
#> [1] 0.2561184

^{Created on 2019-02-04 by the reprex package (v0.2.1)}

zkamvar commented 5 years ago

I am not too sure, but I feel that the term incidence is used in a wrong way in this package. The way I understand it, at the moment we measure "counts over time" (e.g. 5 cases in week X). But incidence rate is usually measured as count over time over population (e.g. 5 cases per 100 person-days).

I think you are correct. Incidence does tend to imply a rate and I believe the correct term to describe what we are doing with this package would be 'incidents per dated events' instead of 'incidence of dated events' (which shows where the initial confusion may have taken place).

So I feel we would somehow need to include a way to include the population at risk into some functions. Maybe I am totally on the wrong track, please don't hesitate to point out, if I am thinking in the wrong way.

I think this could be a useful addition to the package, perhaps in a function called incidence_rate()? This would also require information about the standing population size, which could be added as a new item in the incidence object called $popsize, which would reflect the population sizes of the groups in the object. For those that want to compute incidence rate, this would be handy when sub-setting data, and for those who simply want incident counts, the popsize can be left as NULL. What do you think?

Thank you for bringing up this issue, @jakobschumacher. It's definitely important to make sure our terminology is correct as we move forward.

thibautjombart commented 5 years ago

Hi there

Yeah it's funny, and maybe a cultural thing to an extent - modellers tend to use 'incidence' to refer to case counts, hence the original terminology.

I think the suggested addition makes sense. One tricky thing to handle is how $popsize will look with population stratification. We probably need to handle a matrix of integers matching $counts to allow different population sizes in time and across groups, and an intuitive and easy way to populate it when all or part of that info is available.

On Mon, Feb 4, 2019, 05:03 Zhian N. Kamvar <notifications@github.com wrote:

I am not too sure, but I feel that the term incidence is used in a wrong way in this package. The way I understand it, at the moment we measure "counts over time" (e.g. 5 cases in week X). But incidence rate is usually measured as count over time over population (e.g. 5 cases per 100 person-days).

I think you are correct. Incidence does tend to imply a rate and I believe the correct term to describe what we are doing with this package would be 'incidents per dated events' instead of 'incidence of dated events' (which shows where the initial confusion may have taken place).

So I feel we would somehow need to include a way to include the population at risk into some functions. Maybe I am totally on the wrong track, please don't hesitate to point out, if I am thinking in the wrong way.

I think this could be a useful addition to the package, perhaps in a function called incidence_rate()? This would also require information about the standing population size, which could be added as a new item in the incidence object called $popsize, which would reflect the population sizes of the groups in the object. For those that want to compute incidence rate, this would be handy when sub-setting data, and for those who simply want incident counts, the popsize can be left as NULL. What do you think?

Thank you for bringing up this issue, @jakobschumacher https://github.com/jakobschumacher. It's definitely important to make sure our terminology is correct as we move forward.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/reconhub/incidence/issues/102#issuecomment-460118878, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQkIgV4tq9zi1ySkgvmOteyp6uvOzNVks5vJ6L-gaJpZM4agRdP .

jakobschumacher commented 5 years ago

Thanks for picking this up. In my "epidemiological" eyes, there are three levels of incidence.

First level is "counts per time". All we need is a vector of dates, possibly with groups. Thats the way the package works now
Second level is "cumulative incidence" that is counts per time per population. So we have a vector of dates and a global population count. If we want to compare groups we would need count and population data stratified for groups. One could imagine having counts of influenza cases per region, together with population data for the regions.
The third level is "incidence rate" that is counts per population-time. So we have the above data but also the specific dates when each individual enters or leaves the population. As above we would need all of this information for every group that we want to measure. That would be the example given above with the different centers for foreigners. For every individual we know when he entered the center and when he left. There is another complication: some exit dates would be permanent (think of death) others could be temporary (move out of the center for a while).

From my perspective the best way would be to extend the incidence function. So to give users the possibility to add further information to the incidence object. If you just give the incidence object counts - than you can get only counts out (status quo). if you give it population data as well it can compute cumulative incidence as well. And if you also feed it with dates for each individual entering and leaving groups, than you can compute incidence rate.

But I have to say for me this sounds very complicated to implement and to make it user-friendly as well.

caijun commented 5 years ago

I am also confused with those terms, such as incidence vs. prevalence. Moreover, when we fit epidemic models, such as SIR, to epidemic curves, we also needs to be careful whether the epicurve is incidence (count or rate) or prevalence. I think if the incidence package could figure out a way to soundly tell those terms, it would make an easier life for the epidemiological community.

As introduced in the wikipedia, when the population at risk varies with time, person-time incidence rate should be used. If I understand correctly, this is the third situation that @jakobschumacher discussed.

thibautjombart commented 5 years ago

I think there is a real different use of incidence which needs clarifying in the package. I don't think incidence vs prevalence is that common, so wouldn't put too much emphasis on this, though it could be mentioned in the doc.

On Tue, Feb 5, 2019, 05:36 Jun Cai <notifications@github.com wrote:

I am also confused with those terms, such as incidence vs. prevalence. Moreover, when we fit epidemic models, such as SIR, to epidemic curves, we also needs to be careful whether the epicurve is incidence (count or rate) or prevalence. I think if the incidence package could figure out a way to soundly tell those terms, it would make an easier life for the epidemiological community.

As introduced in the wikipedia https://en.wikipedia.org/wiki/Incidence_(epidemiology), when the population at risk varies with time, person-time incidence rate should be used. If I understand correctly, this is the third situation that @jakobschumacher https://github.com/jakobschumacher discussed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/reconhub/incidence/issues/102#issuecomment-460501667, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQkIg7GtWYxXcpJcPlL4VOJ7nTMoDTOks5vKPw0gaJpZM4agRdP .

jakobschumacher commented 5 years ago

There are indeed some definitions that dont include the denominator "population" but most definitions do (There is an overview here: https://medical-dictionary.thefreedictionary.com/incidence). For me one credible source is the FEM-Wiki by the ECDC https://wiki.ecdc.europa.eu/fem/w/wiki/incidence-rate which includes a denominator Prevalence would be "counts over population". I see prevalence being used for easy studies for example when looking for disease burden of MRSA in nursing homes. Each year the go around and test all residents and say 3 cases of MRSA of 320 residents for example.

zkamvar commented 5 years ago

As a bit of bookkeeping, Bertrand Sudre shares this concern with Jakob:

... the current name can be misleading for a certain number of epidemiologists. The reason is that in epidemiology the term incidence is traditionally associated to a measure of morbidity, so-called 'Incidence proportion' (or attack rate or risk; for more information see: https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson3/section2.html). The latter comprised the numerator (= count of case used for a raw epi-curve) but as well a denominator representing the population at risk during the selected time interval. At the first glance, the target audience reading the package name might believe that the package is dedicated to incidence calculation rather than epidemiological curve graphic representation and some basic modelling utilities. Indeed, the package is presented as being able to compute, handle and visualize time-related count data through epi-curve and additional derived features which are not related to measure of incidence (proportion) strictly speaking.

aspina7 commented 4 years ago

I think there is a real different use of incidence which needs clarifying in the package. I don't think incidence vs prevalence is that common, so wouldn't put too much emphasis on this, though it could be mentioned in the doc.

agree with jakob - in fetp world there is a strict and clear difference between the two!

reconhub / incidence

"Count over time" vs "incidence rate" #102

Example of a cohort with changing populations

Computing incidence rates for the above example