positive_rate in owid-covid-data.csv outdated for several countries

czka commented 2 years ago

As of current master c7b9b886e8 there are several countries which seem to have their positive_rate field in owid-covid-data.csv not updated for some time.

On https://github.com/czka/covid_toll_tool/blob/positive_test_rate_for_report_to_owid/CHARTS.md I've just uploaded charts of the following 2 series, for the 108 countries I'm interested in:

OWID's positive_rate * 100 (dashed orange)
my new_cases_smoothed / new_tests_smoothed * 100 (solid blue)

I'd expect those 2 lines to be identical, or both missing. For most countries they are - with the following exceptions:

lines off significantly: Argentina, Belgium, Colombia, Costa Rica, Mexico, Peru, Spain
lines slightly off: Czechia, Finland, France, Germany, Hong Kong, Liechtenstein, Lithuania, Slovakia, Slovenia, Spain, Switzerland
lines off a bit at some dates: Luxembourg, Palestine
positive_rate missing: Iceland, Lebanon, Quatar
special cases: Austria - positive_rate missing in 2020 and in the beginning of 2021 Brazil - positive_rate missing completely, maybe due to weird test or case count? Ecuador - positive_rate missing for the 1st half of 2020, maybe a similar issue as with Brazil?

edomt commented 2 years ago

Hi @czka

@camappel will be able to investigate this further, but the two main reasons should be:

We remove a few positive rates because the data at our disposal wouldn't produce an accurate measure. They're listed in https://github.com/owid/covid-19-data/blob/master/scripts/scripts/testing/testing_data_corrections.R
Some positive rates we collect from the source, rather than calculate it as new_cases_smoothed / new_tests_smoothed * 100. You can find the list in our documentation, or by looking in our country folder for files where Positive rate is already present.

The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".

camappel commented 2 years ago

Hi @czka ,

I'm glad you're using our data in your project! As @edomt explained, in scenarios where the case definition differs significantly from the test definition (resulting in more cases than reported positive tests), we either remove the positive rate completely, or calculate them directly.

We have deliberately removed positive rate estimates for the following countries (reasons are listed in the script):

Austria
Ecuador
Lebanon
Iceland
Qatar
Brazil (in this script for some reason; I'll move it to the correct script).

We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:

Argentina
Belgium
Colombia
Costa Rica
Czechia
Finland
France
Germany
Hong Kong
Lithuania
Liechtenstein
Mexico
Peru
Slovakia
Slovenia
Spain
Switzerland

To get your graphs to match ours for these countries, use the positive rate column from these CSVs.

This only leaves Luxembourg and Palestine, which you say have 'lines off a bit at some dates'; I took a look at your charts and the differences are negligible, but it's still important to determine the cause. Could it be due to a rounding difference? We round to 3 d.p.s (on line 185 of the generate_dataset script). Try rounding your figures to 3 d.p.s and let me know if the discrepancy persists.

czka commented 2 years ago

We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:

Argentina

Belgium

Colombia

Costa Rica

Czechia

Finland

France

Germany

Hong Kong

Lithuania

Liechtenstein

Mexico

Peru

Slovakia

Slovenia

Spain

Switzerland

@camappel So, if positive_rate = new_cases_smoothed / new_tests_smoothed, how come that my line of new_cases_smoothed / new_tests_smoothed * 100 and line of positive_rate * 100 don't match for these 17 countries although they do match for the majority of other countries?

Please check my CHARTS.md. The 2 lines, although supposed to match closely, are clearly off for Argentina, Belgium, Colombia etc., while they match perfectly for the remaining majority of countries (e.g. Armenia, Australia, Bolivia, Bulgaria and several dozen other).

I swear that all those 3*108 charts in my CHARTS.md were generated from a fresh owid-covid-data.csv in a uniform manner - namely using my covid_toll_tool.py, function plot_weekly.

If new_cases_smoothed / new_tests_smoothed * 100 and positive_rate * 100 match closely for around 80 of 108 countries, but they don't match for 17 countries although they should, I suppose it's fair to assume there's something fishy going on, right?

I appreciate that you have tried to address all my concerns in one batch. However, to have clear communication, let's get these 17 countries sorted out first.

edomt commented 2 years ago

Hi @czka

What @camappel meant to say is that for these 17 countries, the positive rate is recalculated on the basis of the positive tests reported by the national source, rather than the cases reported by Johns Hopkins University. Therefore, new_cases_smoothed is not the right variable to compare with. You may have missed this paragraph in my initial reply above:

Some positive rates we collect from the source, rather than calculate it as new_cases_smoothed / new_tests_smoothed * 100. You can find the list in our documentation, or by looking in our country folder for files where Positive rate is already present.

The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".

If you'd like to know exactly how each of these positive rates is calculated, you can look inside each country script in the automations folder. Two examples:

Argentina: the positive rate is calculated on line 27.
France: the positive rate is imported from a separate file where it is precalculated by the French government.

On a more "meta" note, we greatly appreciate user feedback as it ensures our work is checked by different pairs of eyes, and corrections can be made when necessary. This being said:

Feedback is more useful for us when it's targeted at a specific file or country, rather than at many files at once. Your previous issue #2153 also included a list of many countries and many time periods, asking our team to "verify if really neither of these could be due to OWID's processing rather than any other issues". This type of wide-ranging report is less useful to us, as we simply don't have the resources to tell a member of our team to stop their work to check so many countries, without more precise information.
While I appreciate your concern that the data could be wrong, I believe that expressions such as "something fishy" or "let's get this sorted out first" aren't useful to this discussion.

czka commented 2 years ago

@edomt Your clarification on how positive rate is calculated for these 17 countries helps. Thank you. The information in https://github.com/owid/covid-19-data/tree/master/public/data/README.md on how positive_rate is derived doesn't cover this however. Per that information I assumed it was derived in an identical fashion for all countries. It might help if that README was extended accordingly. It could have probably spared you from my annoyingly exhaustive report ;).

As to your "meta note":

I put everything together in a single ticket rather than flooding you with a dozen of them. Sorry about the hassle, but I thought it would be more efficient to have it all in one place. I'll think twice next time.
I used those words with no ill intent. When I wrote "let's get this sorted out 1st" - that's all I meant. Namely not to get distracted about the minor discrepancies while we are (or at least - I am) trying to understand the biggest one.

Final conclussion for me is that positive_rate means different thing for different countries, although the field is named the same for all of them. Confusing as it is, but now I get it. Or do I?...

owid / covid-19-data

positive_rate in owid-covid-data.csv outdated for several countries #2333