owid / covid-19-data

Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
https://ourworldindata.org/coronavirus
5.66k stars 3.64k forks source link

positive_rate in owid-covid-data.csv outdated for several countries #2333

Closed czka closed 2 years ago

czka commented 2 years ago

As of current master c7b9b886e8 there are several countries which seem to have their positive_rate field in owid-covid-data.csv not updated for some time.

On https://github.com/czka/covid_toll_tool/blob/positive_test_rate_for_report_to_owid/CHARTS.md I've just uploaded charts of the following 2 series, for the 108 countries I'm interested in:

I'd expect those 2 lines to be identical, or both missing. For most countries they are - with the following exceptions:

edomt commented 2 years ago

Hi @czka

@camappel will be able to investigate this further, but the two main reasons should be:

The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".

camappel commented 2 years ago

Hi @czka ,

I'm glad you're using our data in your project! As @edomt explained, in scenarios where the case definition differs significantly from the test definition (resulting in more cases than reported positive tests), we either remove the positive rate completely, or calculate them directly.

We have deliberately removed positive rate estimates for the following countries (reasons are listed in the script):

We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:

To get your graphs to match ours for these countries, use the positive rate column from these CSVs.

This only leaves Luxembourg and Palestine, which you say have 'lines off a bit at some dates'; I took a look at your charts and the differences are negligible, but it's still important to determine the cause. Could it be due to a rounding difference? We round to 3 d.p.s (on line 185 of the generate_dataset script). Try rounding your figures to 3 d.p.s and let me know if the discrepancy persists.

czka commented 2 years ago

We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:

  • Argentina
  • Belgium
  • Colombia
  • Costa Rica
  • Czechia
  • Finland
  • France
  • Germany
  • Hong Kong
  • Lithuania
  • Liechtenstein
  • Mexico
  • Peru
  • Slovakia
  • Slovenia
  • Spain
  • Switzerland

@camappel So, if positive_rate = new_cases_smoothed / new_tests_smoothed, how come that my line of new_cases_smoothed / new_tests_smoothed * 100 and line of positive_rate * 100 don't match for these 17 countries although they do match for the majority of other countries?

Please check my CHARTS.md. The 2 lines, although supposed to match closely, are clearly off for Argentina, Belgium, Colombia etc., while they match perfectly for the remaining majority of countries (e.g. Armenia, Australia, Bolivia, Bulgaria and several dozen other).

I swear that all those 3*108 charts in my CHARTS.md were generated from a fresh owid-covid-data.csv in a uniform manner - namely using my covid_toll_tool.py, function plot_weekly.

If new_cases_smoothed / new_tests_smoothed * 100 and positive_rate * 100 match closely for around 80 of 108 countries, but they don't match for 17 countries although they should, I suppose it's fair to assume there's something fishy going on, right?

I appreciate that you have tried to address all my concerns in one batch. However, to have clear communication, let's get these 17 countries sorted out first.

edomt commented 2 years ago

Hi @czka

What @camappel meant to say is that for these 17 countries, the positive rate is recalculated on the basis of the positive tests reported by the national source, rather than the cases reported by Johns Hopkins University. Therefore, new_cases_smoothed is not the right variable to compare with. You may have missed this paragraph in my initial reply above:

Some positive rates we collect from the source, rather than calculate it as new_cases_smoothed / new_tests_smoothed * 100. You can find the list in our documentation, or by looking in our country folder for files where Positive rate is already present.

The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".

If you'd like to know exactly how each of these positive rates is calculated, you can look inside each country script in the automations folder. Two examples:


On a more "meta" note, we greatly appreciate user feedback as it ensures our work is checked by different pairs of eyes, and corrections can be made when necessary. This being said:

czka commented 2 years ago

@edomt Your clarification on how positive rate is calculated for these 17 countries helps. Thank you. The information in https://github.com/owid/covid-19-data/tree/master/public/data/README.md on how positive_rate is derived doesn't cover this however. Per that information I assumed it was derived in an identical fashion for all countries. It might help if that README was extended accordingly. It could have probably spared you from my annoyingly exhaustive report ;).

As to your "meta note":

  1. I put everything together in a single ticket rather than flooding you with a dozen of them. Sorry about the hassle, but I thought it would be more efficient to have it all in one place. I'll think twice next time.
  2. I used those words with no ill intent. When I wrote "let's get this sorted out 1st" - that's all I meant. Namely not to get distracted about the minor discrepancies while we are (or at least - I am) trying to understand the biggest one.

Final conclussion for me is that positive_rate means different thing for different countries, although the field is named the same for all of them. Confusing as it is, but now I get it. Or do I?...