Streamline parsers and manual .tsv, Re issue #12

neherlab / covid19_scenarios_data

Data preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project

https://github.com/neherlab/covid19_scenarios

Other

41 stars 34 forks source link

Streamline parsers and manual .tsv, Re issue #12 #23

Closed noleti closed 4 years ago

noleti commented 4 years ago

Re: https://github.com/neherlab/covid19_scenarios_data/issues/12 This set of patches changes existing parsers to also update a case-counts/case_counts.json. Parsers somewhat intelligently merge their data with existing data for a region or country (only add missing data for each day, never overwriting or removing). Order of parsers determines precedence. I also added a tsv.py parser to parse all .tsv files in the case-counts/ folder directly, and update the big .json with those values.

I still don't have the covid19_scenarios frontend running, so I could only do limited tests. This patch should not break anything, as old .tsv files are still produced. They are currently actually parsed again by tsv.py, but as we have a merge algorithm now this should not be an issue (runtime is wasted, but on my machine everything is fast enough).

rneher commented 4 years ago

Thanks a lot. This looks super useful and I agree we would work of the case-counts.json directly. I just tried and ran into one problem though. There are duplicated entries in the json and a mix of strings and numbers. These are values for Germany.

    {
      "time": "2020-01-26",
      "deaths": 0,
      "cases": 0
    },
    {
      "time": "2020-1-26",
      "cases": "0",
      "deaths": "0",
      "recovered": "0"
    },
    {
      "time": "2020-01-27",
      "deaths": 0,
      "cases": 0
    },

Also not that the date format is not 100% consistent: it should be 'YYYY-MM-DD' which ensures sortability. I think this might the output of the cds.py parser.

rneher commented 4 years ago

Otherwise it looks great and merges well with the fixes to the india parser I pushed.

noleti commented 4 years ago

can you point me to a duplicated entry? I thought I catch that

rneher commented 4 years ago

I should have been more precise. There are entries in the case-counts.json that for Germany (just an example) with date 2020-1-26 and 2020-01-26. And of them has data as string, the other as numbers.

noleti commented 4 years ago

What should be the default behaviour for empty values? None, or 0?

rneher commented 4 years ago

covid19_scenarios expects something that evaluates to false. The current parser puts None. I think None is better than 0 because 0 could be interpreted as data. We can think about something like nan or NA, but for now None/null is good.

noleti commented 4 years ago

Ok, I think I fixed everything now. Can you have a look again?

rneher commented 4 years ago

This looks very good now. One quick thing I just noticed. In the ECDC parser, the cumulative counts should be assigned even when there is not update.

--- a/parsers/ecdc.py
+++ b/parsers/ecdc.py
@@ -69,10 +69,10 @@ def retrieve_case_data():
         for d in data:
             if d['cases']:
                 total_cases += d['cases']
-                d['cases'] = total_cases
+            d['cases'] = total_cases
             if d['deaths']:
                 total_deaths += d['deaths']
-                d['deaths'] = total_deaths
+            d['deaths'] = total_deaths

ping me when you are done and I'll merge.

noleti commented 4 years ago

Done. I checked, and there are a lot of cases where deaths or cases are None, although total is >0. So I assume the dataset uses None if no case is reported, instead of 0

rneher commented 4 years ago

This all works as far as I can tell. Thanks so much and sorry about the fiddliness... I'll merge and test drive it a little more.

noleti commented 4 years ago

Great, thanks a lot for all your work. I think interactive site like the covid_scenarios are super helpful to understand the current issue and make informed decisions for the general population.