owid / covid-19-data

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
https://ourworldindata.org/coronavirus
5.66k stars 3.64k forks source link

Sum of new_cases is not equal to total_cases for Asia #1272

Closed vyaduvanshi closed 3 years ago

vyaduvanshi commented 3 years ago
df[df.continent == 'Asia'].new_cases.sum()
> 40623659.0

df[df.iso_code == 'OWID_ASI'].total_cases.iloc[-1]
> 41447440.0

As is visible, a big chunk (823,781) is missing. What's the cause for this discrepancy?

lucasrodes commented 3 years ago

Hi @vyaduvanshi, Thanks for reporting.

Indeed there is a substantial difference when we compute the following:

>>> df[df.continent == 'Asia'].new_cases.sum()
41666800.0
>>> df[df.location == 'Asia'].total_cases.iloc[-1]
42490581.0

(Numbers differ from yours due to change in execution date).

I debugged for a while (click for the details)... I constructed a DataFrame with some debugging metrics per country, to see where this was coming from. ```python # Build df for Asian territories df_asia = df.loc[df.continent == 'Asia', ["date", "location", "new_cases", "total_cases"]].sort_values("date") df_asia[["new_cases", "total_cases"]] = df_asia[["new_cases", "total_cases"]].fillna(0) # Compute debugging metrics for each location locations = df_asia.location.unique() records = [] for c in countries: print(c, end=", ") df_c = df_asia.loc[df_asia.location == c] df_c = df_asia.assign( debug=df_c.new_cases.cumsum()+df_c.total_cases.apply(lambda a: min(a, df_c[df_c.total_cases!=0].total_cases.min())), debug_diff=df_c.total_cases-df_c.new_cases.cumsum()-df_c.total_cases.apply(lambda a: min(a, df_c[df_c.total_cases!=0].total_cases.min())), debug2=df_c.new_cases.cumsum(), debug_diff2=df_c.total_cases-df_c.new_cases.cumsum() ) records.append({ "location": c, "debug_diff": df_c.debug_diff2.sum(), "debug_diff_with_correction": df_c.debug_diff.sum(), }) df_debug = pd.DataFrame(records) ``` Results were: ``` location debug_diff debug_diff_with_correction 0 Thailand 1884.0 0.0 1 Taiwan 471.0 0.0 2 South Korea 471.0 0.0 3 China 258108.0 0.0 4 Japan 942.0 0.0 5 Singapore 0.0 -470.0 6 Vietnam 0.0 -940.0 7 Nepal 0.0 -468.0 8 Malaysia 0.0 -1404.0 9 Cambodia 0.0 -466.0 10 Sri Lanka 0.0 -466.0 11 United Arab Emirates 0.0 -1856.0 12 India 0.0 -463.0 13 Philippines 0.0 -463.0 14 Hong Kong 0.0 0.0 15 Iran 0.0 -886.0 16 Israel 0.0 -441.0 17 Lebanon 0.0 -441.0 18 Oman 0.0 -876.0 19 Bahrain 0.0 -438.0 20 Iraq 0.0 -438.0 21 Afghanistan 0.0 -438.0 22 Kuwait 0.0 -438.0 23 Pakistan 0.0 -874.0 24 Georgia 0.0 -436.0 25 Qatar 0.0 -433.0 26 Azerbaijan 0.0 -1296.0 27 Armenia 0.0 -432.0 28 Indonesia 0.0 -862.0 29 Saudi Arabia 0.0 -431.0 30 Bangladesh 0.0 -1275.0 31 Jordan 0.0 -430.0 32 Palestine 0.0 -1712.0 33 Bhutan 0.0 -427.0 34 Maldives 0.0 -1700.0 35 Brunei 0.0 -424.0 36 Mongolia 0.0 -423.0 37 Turkey 121837300.0 121836878.0 38 Kazakhstan 0.0 -1680.0 39 Uzbekistan 0.0 -418.0 40 Kyrgyzstan 0.0 -1245.0 41 Syria 0.0 -411.0 42 Timor 0.0 -411.0 43 Laos 0.0 -818.0 44 Myanmar 0.0 -3248.0 45 Yemen 0.0 -392.0 46 Tajikistan 0.0 -5580.0 47 Northern Cyprus 0.0 0.0 48 Macao 0.0 0.0 ``` - If `debug_diff` is zero, means the checks you did should work for this country, i.e. `df[df.continent == country].new_cases.sum()` = `df[df.location == country].total_cases.iloc[-1]`. - If `debug_diff` is not zero, but `debug_diff_with_corrections` is, it means that the problem is in `new_cases` column not tracking the first value for new_cases. - If both `debug_diff` and `debug_diff_with_corrections` are not zero, there is something going on. This just occurs with Turkey.

In general, I found that some countries do not register the first entry for new_cases (see Thailand for instance), this leads to some miss-adjustment if you apply the cumulative sum and compare with total_cases.

However, most of the miss-adjustment is coming from Turkey. Apparently, on the 2020-12-10, no entry for new_cases is registered. This has its origin here:

https://github.com/owid/covid-19-data/blob/4d1c0f4c19fe491dd85b904a4c7262303e6f33a6/scripts/scripts/jhu.py#L149-L155

I'll investigate if this bug is fixed and that line is no longer needed.

My recommendation, if you need to work with cumulative values, use total_cases column instead.

vyaduvanshi commented 3 years ago

@lucasrodes Thanks for the extensive look into this. I will resort to using total_cases for cumulative values.

I wrote code to get the values of other similar features too to look for discrepancies. ```python iso_codes_to_check = ['OWID_AFR','OWID_ASI','OWID_EUR','OWID_NAM','OWID_OCE','OWID_SAM','OWID_WRL'] continent_list = ['Asia','Europe', 'Africa', 'North America', 'South America','Oceania'] total_features = ['total_cases','total_deaths','total_vaccinations'] new_features = ['new_cases','new_deaths','new_vaccinations'] records = [] #Continent and World sums of new_features for feature in new_features: records.append({'continent':'World', feature+'_sum':df.groupby('iso_code').sum().loc['OWID_WRL'][feature]}) #World sums a = df.groupby('continent')[feature].sum() for x in range(len(continent_list)): records.append({feature+'_sum':a[x],'continent':a.index[x]}) #Continent and World last value of total_features for iso in iso_codes_to_check: for feature in total_features: temp_df = df[df.iso_code == iso] records.append({'continent':temp_df.location.iloc[0],feature:temp_df[feature].iloc[-1]}) result_df = pd.DataFrame(records).groupby('continent').last() #Creating diff features (total-sum_of_new) result_df['diff_cases'] = result_df.iloc[:,3] - result_df.iloc[:,0] result_df['diff_deaths'] = result_df.iloc[:,4] - result_df.iloc[:,1] result_df['diff_vaccinations'] = result_df.iloc[:,5] - result_df.iloc[:,2] result_df.reset_index() ```
continent | new_cases_sum | new_deaths_sum | new_vaccinations_sum | total_cases | total_deaths | total_vaccinations | diff_cases | diff_deaths | diff_vaccinations | continent_duplicate -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Africa | 4616997.0 | 123905.0 | 1.112864e+07 | 4616997.0 | 123905.0 | 1.974450e+07 | 0.0 | 0.0 | 8615860.0 | Africa Asia | 42171896.0 | 557642.0 | 4.787374e+08 | 42995677.0 | 557659.0 | 6.073388e+08 | 823781.0 | 17.0 | 128601370.0 | Asia Europe | 45614114.0 | 1033871.0 | 2.156050e+08 | 45614114.0 | 1033871.0 | 2.533579e+08 | 0.0 | 0.0 | 37752902.0 | Europe North America | 37948751.0 | 855503.0 | 2.792794e+08 | 37948752.0 | 855503.0 | 2.969502e+08 | 1.0 | 0.0 | 17670747.0 | North America Oceania | 44331.0 | 1061.0 | 2.887893e+06 | 44331.0 | 1061.0 | 2.936723e+06 | 0.0 | 0.0 | 48830.0 | Oceania South America | 25681088.0 | 697842.0 | 7.720187e+07 | 25681088.0 | 697842.0 | 8.305298e+07 | 0.0 | 0.0 | 5851110.0 | South America World | 156077898.0 | 3269839.0 | 1.263381e+09 | 156901680.0 | 3269856.0 | 1.263381e+09 | 823782.0 | 17.0 | 5.0 | World


As you can see, there are big variations in figures of vaccinations as well (unless I did something terribly wrong). The total_vaccinations of continents adds up to more than total_vaccinations of World. Worth bringing your attention towards.



P.S. I read your code, I guess I don't understand how exactly this line helps. df_c.total_cases.apply(lambda a: min(a, df_c[df_c.total_cases!=0].total_cases.min()))

edomt commented 3 years ago

Most of this discrepancy indeed comes from this very large data correction for Turkey in December:

coronavirus-data-explorer(4)

The change is so large that it was distorting the 7-day average number of new cases not just for Turkey, but also for Asia and even World. So we decided to remove the daily difference from new_cases.

For total_vaccinations I don't think there's currently an issue. Our latest update shows: