Wrong way to calculate age distribution

bukosabino commented 4 years ago

Hi @jriou,

I think there is an error calculating age distribution in your R experiment:

> age_dist = read_excel("data/age_structure.xlsx") %>%
+   filter(country=="Italy") %>%
+   gather("age","n",3:23) %>%
+   mutate(n=as.numeric(n),age2=c(0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80)) %>%
+   group_by(age2) %>%
+   summarise(n=sum(n)) %>%
+   pull(n)

> age_dist = age_dist/sum(age_dist)
> pop_t = sum(c(10.04e6,4.453e6,4.905e6,1.522e6,4.356e6))
> age_dist
[1] 0.08595641 0.09456890 0.10178792 0.11909651 0.15515993 0.15356926 0.12112699 0.09653392 0.07220017
> print(age_dist)
[1] 0.08595641 0.09456890 0.10178792 0.11909651 0.15515993 0.15356926 0.12112699 0.09653392 0.07220017

The problem is related to the format on data/age_structure.xlsx file. If you take a look to the xlsx file:

Italy	60627.291	2442.494	2768.81	2869.745	2863.711	2973.472	3197.654	3417.097	3803.402	4469.335	4937.591	4879.297	4431.191	3779.643	3563.958	3213.162	2639.428	2220.209	1327.127	631.952	181.855	16.158

I think you need to remove the dot of the values on the dataset to get the real age distribution results, because in R, as.numeric doesn't transform these numbers to integers:

> as.numeric(2442.494)
[1] 2442.494

I can share my Python code to do the same:

First version, with the line commented to get the same results:

>>> import os
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_excel(os.path.join("data", "age_structure.xlsx"))
>>> df['country'] = df['country'].str.strip()
>>> df = df[df['country'] == 'Italy']
>>> # df = df.apply(lambda x : x.astype(str).str.replace(r'\.', ''))  # Posible bug in R experiment
>>> df.drop(['country', 'all_classes'], axis=1, inplace=True)
>>> df = df.T
>>> df.rename(columns={df.columns[0]: "values"}, inplace=True)
>>> df['values'] = df['values'].astype(int)
>>> df['age'] = np.array([0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80])
>>> age_dist = df.groupby("age").sum()['values'].values
>>> age_dist = age_dist/sum(age_dist)
>>> print(age_dist)
[0.08594949 0.09456093 0.10178663 0.1191085  0.15517099 0.15358728
 0.12112114 0.09654057 0.07217447]

Second version, using the line to delete the dot and get the real age distribution:

>>> import os
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_excel(os.path.join("data", "age_structure.xlsx"))
>>> df['country'] = df['country'].str.strip()
>>> df = df[df['country'] == 'Italy']
>>> df = df.apply(lambda x : x.astype(str).str.replace(r'\.', ''))  # Posible bug in R experiment
>>> df.drop(['country', 'all_classes'], axis=1, inplace=True)
>>> df = df.T
>>> df.rename(columns={df.columns[0]: "values"}, inplace=True)
>>> df['values'] = df['values'].astype(int)
>>> df['age'] = np.array([0,0,10,10,20,20,30,30,40,40,50,50,60,60,70,70,80,80,80,80,80])
>>> age_dist = df.groupby("age").sum()['values'].values
>>> age_dist = age_dist/sum(age_dist)
>>> print(age_dist)
[0.04677661 0.09862252 0.10615099 0.1242015  0.16181074 0.16015189
 0.126319   0.10067177 0.07529498]

You can find the error here:

https://github.com/jriou/covid_adjusted_cfr/blob/9e041b8beac0c6a07b626a57e128f128f4e2872c/data/italy/data_management_italy.R#L13 https://github.com/jriou/covid_adjusted_cfr/blob/9e041b8beac0c6a07b626a57e128f128f4e2872c/data/south-korea/data_management_south_korea.R#L19

Let me know if you agree.

Best, Dario

jriou commented 4 years ago

Hi Dario,

Thanks for raising this issue, I'm sorry but in that case I think that you are mistaken.

Since we compute the proportion of population in each age group, dealing with thousands or units doesn't matter. You don't find the same result in your calculation because you made a mistake: by just removing the dot you ignore that in some cases the 0 was ignored if it was the last digit. E.g. in Italy, by removing the dot the value for ages 5-9 goes from 2768.81 to 276881 instead of 2768810.

I think this is the source of the discrepancy but it would be certainly good to check by yourself!

Thanks again for your interest and please do not hesitate to communicate any bug or issue. Best, Julien

bukosabino commented 4 years ago

:O

I didn't see this last digit, and you are right, the proportion should be equal when we are working with thousands or millions.

Best, Dario

jriou / covid_adjusted_cfr

Wrong way to calculate age distribution #4