Closed czka closed 2 years ago
Hi @czka
@camappel will be able to investigate this further, but the two main reasons should be:
We remove a few positive rates because the data at our disposal wouldn't produce an accurate measure. They're listed in https://github.com/owid/covid-19-data/blob/master/scripts/scripts/testing/testing_data_corrections.R
Some positive rates we collect from the source, rather than calculate it as new_cases_smoothed / new_tests_smoothed * 100
. You can find the list in our documentation, or by looking in our country folder for files where Positive rate
is already present.
The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".
Hi @czka ,
I'm glad you're using our data in your project! As @edomt explained, in scenarios where the case definition differs significantly from the test definition (resulting in more cases than reported positive tests), we either remove the positive rate completely, or calculate them directly.
We have deliberately removed positive rate estimates for the following countries (reasons are listed in the script):
We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:
To get your graphs to match ours for these countries, use the positive rate
column from these CSVs.
This only leaves Luxembourg and Palestine, which you say have 'lines off a bit at some dates'; I took a look at your charts and the differences are negligible, but it's still important to determine the cause. Could it be due to a rounding difference? We round to 3 d.p.s (on line 185 of the generate_dataset script). Try rounding your figures to 3 d.p.s and let me know if the discrepancy persists.
We calculate the positive rate directly - by dividing the number of positive tests by the number of tests - for the following countries:
- Argentina
- Belgium
- Colombia
- Costa Rica
- Czechia
- Finland
- France
- Germany
- Hong Kong
- Lithuania
- Liechtenstein
- Mexico
- Peru
- Slovakia
- Slovenia
- Spain
- Switzerland
@camappel So, if positive_rate = new_cases_smoothed / new_tests_smoothed
, how come that my line of new_cases_smoothed / new_tests_smoothed * 100
and line of positive_rate * 100
don't match for these 17 countries although they do match for the majority of other countries?
Please check my CHARTS.md. The 2 lines, although supposed to match closely, are clearly off for Argentina, Belgium, Colombia etc., while they match perfectly for the remaining majority of countries (e.g. Armenia, Australia, Bolivia, Bulgaria and several dozen other).
I swear that all those 3*108 charts in my CHARTS.md were generated from a fresh owid-covid-data.csv
in a uniform manner - namely using my covid_toll_tool.py, function plot_weekly
.
If new_cases_smoothed / new_tests_smoothed * 100
and positive_rate * 100
match closely for around 80 of 108 countries, but they don't match for 17 countries although they should, I suppose it's fair to assume there's something fishy going on, right?
I appreciate that you have tried to address all my concerns in one batch. However, to have clear communication, let's get these 17 countries sorted out first.
Hi @czka
What @camappel meant to say is that for these 17 countries, the positive rate is recalculated on the basis of the positive tests reported by the national source, rather than the cases reported by Johns Hopkins University. Therefore, new_cases_smoothed
is not the right variable to compare with. You may have missed this paragraph in my initial reply above:
Some positive rates we collect from the source, rather than calculate it as
new_cases_smoothed / new_tests_smoothed * 100
. You can find the list in our documentation, or by looking in our country folder for files wherePositive rate
is already present.The latter situation applies mainly to countries where a positive rate is reported by the authorities and differs substantially from the one obtained with "JHU cases divided by OWID tests".
If you'd like to know exactly how each of these positive rates is calculated, you can look inside each country script in the automations
folder. Two examples:
On a more "meta" note, we greatly appreciate user feedback as it ensures our work is checked by different pairs of eyes, and corrections can be made when necessary. This being said:
Feedback is more useful for us when it's targeted at a specific file or country, rather than at many files at once. Your previous issue #2153 also included a list of many countries and many time periods, asking our team to "verify if really neither of these could be due to OWID's processing rather than any other issues". This type of wide-ranging report is less useful to us, as we simply don't have the resources to tell a member of our team to stop their work to check so many countries, without more precise information.
While I appreciate your concern that the data could be wrong, I believe that expressions such as "something fishy" or "let's get this sorted out first" aren't useful to this discussion.
@edomt Your clarification on how positive rate is calculated for these 17 countries helps. Thank you. The information in https://github.com/owid/covid-19-data/tree/master/public/data/README.md on how positive_rate
is derived doesn't cover this however. Per that information I assumed it was derived in an identical fashion for all countries. It might help if that README was extended accordingly. It could have probably spared you from my annoyingly exhaustive report ;).
As to your "meta note":
Final conclussion for me is that positive_rate
means different thing for different countries, although the field is named the same for all of them. Confusing as it is, but now I get it. Or do I?...
As of current master c7b9b886e8 there are several countries which seem to have their
positive_rate
field inowid-covid-data.csv
not updated for some time.On https://github.com/czka/covid_toll_tool/blob/positive_test_rate_for_report_to_owid/CHARTS.md I've just uploaded charts of the following 2 series, for the 108 countries I'm interested in:
positive_rate * 100
(dashed orange)new_cases_smoothed / new_tests_smoothed * 100
(solid blue)I'd expect those 2 lines to be identical, or both missing. For most countries they are - with the following exceptions:
lines off significantly: Argentina, Belgium, Colombia, Costa Rica, Mexico, Peru, Spain
lines slightly off: Czechia, Finland, France, Germany, Hong Kong, Liechtenstein, Lithuania, Slovakia, Slovenia, Spain, Switzerland
lines off a bit at some dates: Luxembourg, Palestine
positive_rate
missing: Iceland, Lebanon, Quatarspecial cases: Austria -
positive_rate
missing in 2020 and in the beginning of 2021 Brazil -positive_rate
missing completely, maybe due to weird test or case count? Ecuador -positive_rate
missing for the 1st half of 2020, maybe a similar issue as with Brazil?