jriou / covid_adjusted_cfr

MIT License
88 stars 29 forks source link

Wrong way to calculate age distribution #4

Closed bukosabino closed 4 years ago

bukosabino commented 4 years ago

Hi @jriou,

I think there is an error calculating age distribution in your R experiment:

> age_dist = read_excel("data/age_structure.xlsx") %>%
+   filter(country=="Italy") %>%
+   gather("age","n",3:23) %>%
+   mutate(n=as.numeric(n),age2=c(0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80)) %>%
+   group_by(age2) %>%
+   summarise(n=sum(n)) %>%
+   pull(n)

> age_dist = age_dist/sum(age_dist)
> pop_t = sum(c(10.04e6,4.453e6,4.905e6,1.522e6,4.356e6))
> age_dist
[1] 0.08595641 0.09456890 0.10178792 0.11909651 0.15515993 0.15356926 0.12112699 0.09653392 0.07220017
> print(age_dist)
[1] 0.08595641 0.09456890 0.10178792 0.11909651 0.15515993 0.15356926 0.12112699 0.09653392 0.07220017

The problem is related to the format on data/age_structure.xlsx file. If you take a look to the xlsx file:

Italy 60627.291 2442.494 2768.81 2869.745 2863.711 2973.472 3197.654 3417.097 3803.402 4469.335 4937.591 4879.297 4431.191 3779.643 3563.958 3213.162 2639.428 2220.209 1327.127 631.952 181.855 16.158

I think you need to remove the dot of the values on the dataset to get the real age distribution results, because in R, as.numeric doesn't transform these numbers to integers:

> as.numeric(2442.494)
[1] 2442.494

I can share my Python code to do the same:

First version, with the line commented to get the same results:

>>> import os
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_excel(os.path.join("data", "age_structure.xlsx"))
>>> df['country'] = df['country'].str.strip()
>>> df = df[df['country'] == 'Italy']
>>> # df = df.apply(lambda x : x.astype(str).str.replace(r'\.', ''))  # Posible bug in R experiment
>>> df.drop(['country', 'all_classes'], axis=1, inplace=True)
>>> df = df.T
>>> df.rename(columns={df.columns[0]: "values"}, inplace=True)
>>> df['values'] = df['values'].astype(int)
>>> df['age'] = np.array([0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80])
>>> age_dist = df.groupby("age").sum()['values'].values
>>> age_dist = age_dist/sum(age_dist)
>>> print(age_dist)
[0.08594949 0.09456093 0.10178663 0.1191085  0.15517099 0.15358728
 0.12112114 0.09654057 0.07217447]

Second version, using the line to delete the dot and get the real age distribution:

>>> import os
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_excel(os.path.join("data", "age_structure.xlsx"))
>>> df['country'] = df['country'].str.strip()
>>> df = df[df['country'] == 'Italy']
>>> df = df.apply(lambda x : x.astype(str).str.replace(r'\.', ''))  # Posible bug in R experiment
>>> df.drop(['country', 'all_classes'], axis=1, inplace=True)
>>> df = df.T
>>> df.rename(columns={df.columns[0]: "values"}, inplace=True)
>>> df['values'] = df['values'].astype(int)
>>> df['age'] = np.array([0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80])
>>> age_dist = df.groupby("age").sum()['values'].values
>>> age_dist = age_dist/sum(age_dist)
>>> print(age_dist)
[0.04677661 0.09862252 0.10615099 0.1242015  0.16181074 0.16015189
 0.126319   0.10067177 0.07529498]

You can find the error here:

https://github.com/jriou/covid_adjusted_cfr/blob/9e041b8beac0c6a07b626a57e128f128f4e2872c/data/italy/data_management_italy.R#L13 https://github.com/jriou/covid_adjusted_cfr/blob/9e041b8beac0c6a07b626a57e128f128f4e2872c/data/south-korea/data_management_south_korea.R#L19

Let me know if you agree.

Best, Dario

jriou commented 4 years ago

Hi Dario,

Thanks for raising this issue, I'm sorry but in that case I think that you are mistaken.

Since we compute the proportion of population in each age group, dealing with thousands or units doesn't matter. You don't find the same result in your calculation because you made a mistake: by just removing the dot you ignore that in some cases the 0 was ignored if it was the last digit. E.g. in Italy, by removing the dot the value for ages 5-9 goes from 2768.81 to 276881 instead of 2768810.

I think this is the source of the discrepancy but it would be certainly good to check by yourself!

Thanks again for your interest and please do not hesitate to communicate any bug or issue. Best, Julien

bukosabino commented 4 years ago

:O

I didn't see this last digit, and you are right, the proportion should be equal when we are working with thousands or millions.

Best, Dario