POLYMODE data - Githubissues

mikkokotila commented 4 years ago

I had a look at it, and it seems that the values are very small and not at all reflective of the number of actual contacts an average person has. So it would be really important to not consider it as a source of truth for absolute values, but just a comparison for age/country and even then very carefully. The total sample size was 10k (1k for Finland) which is impossible to meaningfully stratify to reflect what's actually happening in a larger population. Also, it is a survey-based on self-disclosure of a single day. In honest research language "garbage data".

juyrjola commented 4 years ago

For Finland there are about 11k contact entries in the data and 1006 participants. This should be enough to get a rough idea how contacts are distributed. I would hesitate to label it garbage quite so quickly, but maybe @jampekka can comment more on this?

mikkokotila commented 4 years ago

For Finland there are about 11k contact entries in the data and 1006 participants. This should be enough to get a rough idea how contacts are distributed. I would hesitate to label it garbage quite so quickly, but maybe @jampekka can comment more on this?

By those criteria, it can be definitely concluded that this is very poor quality data source. Which can also be visually observed from the disconnect the data has with what we can demonstrate to be true.

It's not really a matter of opinion if n=1,000 is meaningful to explain something as complex and personal as contacts with other people in N=5,500,000. It's impossible. That said, the data might be useful as a comparison between countries, and as a comparison between age groups. Particularly for comparison between countries. It's not so much to say that the data can't be useful downstream, but that we should know there is an important quality/representation problem upstream.

juyrjola commented 4 years ago

Do you have alternative data sources in mind for contacts? Could we enrich this data somehow?

mikkokotila commented 4 years ago

Do you have alternative data sources in mind for contacts? Could we enrich this data somehow?

I think the question is if for the sake of our model this is better than not having anything, and I think it definitely is. We just have to know that there is an important data integrity issue :)

The solution going forward seems to be an application that people voluntarily install, and where their age, municipality, type of living (house etc), and other factors are keyed in by those who have the application. Because POLYMODE research establishes that this kind of data is important, and we know it here as well, it seems safe to say that Finland desperately needs such data and such an application must be developed in collaboration with the authorities. I think that might happen naturally in the coming months. I guess even 100k people using such an application will give Finland a very important advantage in modeling these things. I will have a discussion with Telia's relevant people to find out if they might be interested in doing something like this with the data they already have (because they have demographics and location for their entire customer base).

For our purpose, I think it makes a lot of sense to use POLYMODE data because it gives additional credibility to our approach, and it also makes intuitive sense in terms of the value it adds compared to how we are doing it now. We just have to know that there is an issue.

jampekka commented 4 years ago

There seems to be a lot of misunderstanding about relationship between sample sizes and estimator quality. In general, a sample of thousand gets usually very good estimates for distributions, especially for continuous or ordinal values. The population size doesn't really matter when it is something huge like a population of the country.

There's some point in the dataset being a bit small for stratification, but stratification is probably quite a bad idea anyway. E.g. ages lend themselves well to regression, which is a lot more efficient with data.

In general, the sampling bias is a lot bigger problem than sample size.

There are plenty of potential problems in applying the POLYMOD to epidemic models, but the sampling size is not a major one. In honest research language, calling the POLYMOD garbage due to its sample size is garbage.

mikkokotila commented 4 years ago

There seems to be a lot of misunderstanding about the relationship between sample sizes and estimator quality.

The thread is initiated on the premise that the data, in terms of absolute values can be verified to have an issue. It poorly reflects what is intuitively and demonstrably happening in the world. That's the only thing we should be concerned with, the meta-discussion on sampling is merely speculation into what might cause the poor quality. To avoid doubt:

Meaningful representation of human life a challenging overtaking, even in large samples (e.g. 10% of the whole population)
Human life is poorly captured in large populations with vast differences in geographical, economic, social, and other preferential and behavioral factors
Human life is poorly captured in a single day and doing it for many people don't make it much better in terms of how it represents a collective of individuals

There's some point in the dataset being a bit small for stratification, but stratification is probably quite a bad idea anyway. E.g., ages lend themselves well to regression, which is a lot more efficient with data.

Unfortunately, that's only true if the dataset size warrants it. That's not the case here. Generally, when we speak about data quality, things do matter. Given the chance, everything else being the same, every researcher will choose a bigger sample, more structure, and more representation.

Consider the two statements:

a) "let's structure the dataset so that it is well representative of the population" b)"let's rely on statistical ideas to overcome shortcomings in the way the data represents the population."

Researchers tend to think and work based on the latter because representative data is historically more of a rarity than a norm.

In general, the sampling bias is a lot bigger problem than sample size.

Yes, well planned and executed data collection tends to be a much bigger factor in good research than the total footprint of the data collection.

There are plenty of potential problems in applying the POLYMOD to epidemic models, but the sampling size is not a major one. In honest research language, calling the POLYMOD garbage due to its sample size is garbage.

When you observe quality issues in the data, you can't ignore it by telling stories. In terms of its absolute values, it well qualifies for "garbage". I understand that it can be a provocative word, so a more polite way would be to say "of poor quality".

It's ok to have issues. To a large part, research is about being familiar with caveats and having a good hold on how they might affect outcomes and how they may limit the work. Then clearly communicating those as part of the research. That is kind of a key point.

jampekka commented 4 years ago

I think there may be some disconnect in what the POLYMOD was gathered for and what someone might want from a dataset. If one wants to "model human life" in general, it is surely a very coarse picture. If OTOH one wants to get an estimate of how many "major interaction" contacts there are in different kinds of places and between different agegroups, it is quite sufficient.

The thread is initiated on the premise that the data, in terms of absolute values can be verified to have an issue. It poorly reflects what is intuitively and demonstrably happening in the world.

I have a different intuition and haven't seen such demonstration. "A contact" in the POLYMOD was instructed as: "Participants were instructed to record contacted individuals only once in the diary. A contact was defined as either skin-to-skin contact such as a kiss or handshake (a physical contact), or a two-way conversation with three or more words in the physical presence of another person but no skin-to-skin contact (a nonphysical contact)."

For these kinds of contacts I don't find averages of around 10 to be very unintuitive.

Researchers tend to think and work based on the latter because representative data is historically more of a rarity than a norm.

What is "representative" depends on what the data is used for. The use of statistical methods is also not (only) for overcoming shortcomings, but to get generalizable descriptions of aspects of complex phenomena.

When you observe quality issues in the data, you can't ignore it by telling stories. In terms of its absolute values, it well qualifies for "garbage". I understand that it can be a provocative word, so a more polite way would be to say "of poor quality".

I'm still not sure what are these quality issues. I don't see any clear indication that the POLYMOD is poorly measuring what it was set to measure. Just that this this measure may not be what you'd want to measure doesn't make the data poor quality.

mikkokotila commented 4 years ago

I think there may be some disconnect in what the POLYMOD was gathered for and what someone might want from a dataset. If one wants to "model human life" in general, it is surely a very coarse picture. If OTOH one wants to get an estimate of how many "major interaction" contacts there are in different kinds of places and between different age groups, it is quite sufficient.

You can't establish sufficiency by saying "it is quite sufficient". You can try to establish it, if you like, by directly overcoming the challenges that have been posited.

I have a different intuition and haven't seen such demonstration. "A contact" in the POLYMOD was instructed as: "Participants were instructed to record contacted individuals only once in the diary. A contact was defined as either skin-to-skin contact such as a kiss or handshake (a physical contact), or a two-way conversation with three or more words in the physical presence of another person but no skin-to-skin contact (a nonphysical contact)."

This definition of contact partially explains why the data has issues. It basically proves why it can't be considered good quality data for "actual contacts".

For example, whereas you might not perceive that you have direct two-way conversations with three or more words, you continuously find yourself in similar situations throughout the day. Meetings, buying something, public transport, and so on.

For these kinds of contacts I don't find averages of around 10 to be very unintuitive.

Yes, I agree, for the kind of contacts as per the POLYMOD description, 10 might be high as a year-round average for the whole population. For the POLYMOD definition of contact, the distribution does not look too bad.

daily_contact_data

Note the premise of this thread from the opening statement "I had a look at it, and it seems that the values are very small and not at all reflective of the number of actual contacts an average person has." This debate is about "actual contacts" and how well this data represents actual contacts.

What is "representative" depends on what the data is used for. The use of statistical methods is also not (only) for overcoming shortcomings,

It seems to me that you are trying to deny the causal relationship results have with data quality. One uses statistical methods to overcome issues in data quality only because one has to, not because it can replace data quality. Data quality issues are "overcome" by awareness and communication.

but to get generalizable descriptions of aspects of complex phenomena.

Yes, that's what statistical methods are actually for; making the best out of the data that you actually have.

I would be delighted if I was wrong, and this dataset indeed is of a good source for actual contacts. As it is now, it is in fact a very useful for many things, and the deliberation that led to the dataset is very compelling. This type of sampling is very costly, which explains the limitations. The key point is that in research doubts are more important than beliefs, and communicating shortcomings is as important as communicating findings.

mikkokotila commented 4 years ago

Just FYI, the original post was prompted by seeing this graphic shared by THL:

Screenshot from 2020-04-22 16-21-57

When you run the tallies for each column, you end up with a very small number.

Also note, the histogram I have shared in the above comment is using the frequency field to calculate how many times the same will happen in a year, and then divide that by 365 to get a daily figure. Interestingly this gives a completely different result than just looking at the number of contacts in a day directly:

download

Which makes sense; it's much harder problem cognitively to say how often something will happen, compared to just saying it happened once.

jampekka commented 4 years ago

I'm not sure what calculation you took in these plots. The latter histogram looks very close to just computing number of contacts per participant of the whole POLYMOD dataset.

Total tally of the tables will be (approximately) equal to the mean lines here. Due to how POLYMOD was designed (only one day per participant), this can be directly used as an estimate of average daily contacts. Selecting out only certain frequencies will cause underestimation.

The frequency data could be perhaps used to improve the estimates somewhat, but more importantly it could be used as an indicator how clustered the connections are.

polymod

jampekka commented 4 years ago

I don't know where that contact matrix is from, but it is probably for just physical contacts. Including non-physical contacts makes the numbers obviously higher, and to alignment the overall means.

mikkokotila commented 4 years ago

After going through many options, I believe that we get the most reasonable result simply by randomly picking an actual record based on age (or multiple records on the age segment of the population) and then using that for establishing daily contacts of given agent/s. Any addition to this seems to simply introduce more bias, without adding anything of value.

I think randomness if we want it, is better applied "one-shot" in a way that is easy to understand and control.

The simplest approach is obviously also the most performant approach for meeting the needs of an agent-based system. With unoptimized python/NumPy daily contacts for 1 million people, the contacts can be computed in about half a second.

The whole approach can be summarized in three steps:

group ages to 0-14, 15-64, and 64-100
use national population age group data to produce the population
randomly (straight up uniform) pick actual records based on the age group

I've intentionally kept the age groups as broad as possible at the moment. We could of course later make it 5-year buckets, or any other grouping where we can readily get national distribution data.

I've created a version-controlled data repository that gives us the transparency and trust we need as we apply to ETL to it and when this data is used as an input for actual predictions.

juyrjola commented 4 years ago

Just to let you know, I've just committed the code to take the contact age and place distribution from POLYMOD into use. It was a bit a of a hassle to make it performant, but in the end I don't think the overall speed took a big hit.

The algorithm works like this:

Input data is the average number of contacts for each contact age group and each place per participant age group. This original data is saved in the ContactMatrix class.
We calculate the overall average number of contacts for each age (ContactMatrix.nr_contacts_by_age) and the relative, cumulative probabilities for each type of contact for that age (ContactMatrix.p_by_age). The last contact type entry for each age should have the cumulative probability of 1.0.
When generating the daily contacts for an agent, we first pick the number of contacts (using a lognormal distribution based on nr_contacts_by_age.
For each contact, we pick a uniform random number between 0 and 1, and iterate over the list of contact types based on the agent's age. This can be visualized as each contact type occupying a segment of a line stretching from 0 to 1.
We maintain a list of agents ordered by age and indices to the start of each of the age blocks. We choose a random agent from the age block as the contact and assign the contact place according to step 4.

juyrjola commented 4 years ago

Using this more refined contact matrix changes the R value and the epidemic spread in several ways. The new code is live at: https://reinatest.kausal.tech/

juyrjola commented 4 years ago

Thanks to @jampekka for producing the smoothed contact matrix data!

mikkokotila commented 4 years ago

Great work guys :)

Can we add the controls to the settings next?

mikkokotila commented 4 years ago

Thanks to @jampekka for producing the smoothed contact matrix data!

What do you mean by "smoothed"? Could you share link to the data and ETL scripts.

jampekka commented 4 years ago

The resulting matrix is smoothed by gaussian convolution (logic being that interactions close by in the matrix should be correlated) to try to improve the estimate (reduce noise caused by the sampling). Not ideal, but probably better than with no smoothing. The code is here: https://github.com/jampekka/contactest It uses data from the POLYMOD Zenodo repo: https://zenodo.org/record/1059920

mikkokotila commented 4 years ago

logic being that interactions close by in the matrix should be correlated

Do you mean that if person-1 and person-2 both have high number of contacts, they are more likely to be connected with each other?

reduce noise caused by the sampling

Which noise and/or noise caused by what specifically in the sampling?

but probably better than with no smoothing

What is the rationale that suggest such smoothing is better than no smoothing?

jampekka commented 4 years ago

Do you mean that if person-1 and person-2 both have high number of contacts, they are more likely to be connected with each other?

No, meaning that if agegroups 1 and 3 have high number of contacts, probably also agegroups 1 and 2 and 3 and 2 have relatively high number of contacts. The smoothing is done after computing the contact matrix.

Which noise and/or noise caused by what specifically in the sampling?

We have limited number of samples in the agegroups x agegroup bins, which increases the noise of the estimator.

What is the rationale that suggest such smoothing is better than no smoothing?

At least one is that without smoothing there are many agegroup pairs which would never have contacts.

kausaltech / reina-model

POLYMODE data #24