Closed rneher closed 4 years ago
@rneher I can work on separating the json generation out as a first step. Looks like that would involve moving https://github.com/neherlab/covid19_scenarios_data/blob/master/generate_data.py#L28-L63 (except for the --parsers
debug part) into a separate script.
Does that sound good?
Each parser has line like this: https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/canada.py#L98
the jsons are currently generated straight from the fetched output
https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/utils.py#L199
and the keys of sub-country levels (e.g states, provinces) are pre-fixed with USA-
or FRA-
etc. o avoid Georgia (country) and Georgia (state) clashes.
https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/utils.py#L188
One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.
One thing one could do is get rid of the deep folder structure and simply have a directory for each parser
ecdc
|-- France.tsv
|-- Spain.tsv
...
unitedstates
|-- USA-Alaska.tsv
|-- USA-New Jersey.tsv
...
manuals
|-- USA-CO-myFavoriteVillage.tsv
...
this would disambiguate the json parsing (keys already fixed), maintain human-readable intermediates, and allow for manually added tsv in case this becomes necessary.
Just an idea. @nnoll and @noleti -- open for suggestions here.
As the json and tsv generation are mostly done by the same data in utils.py, the main reason one is successful and the other fails would be a) a bug (like mine I fixed this morning, sorry!), b) as everything is put in one json, if the script fails while updating the .json, it might become corrupted. That being said, some of the code in utils.py is more complex than it should because we support both json and tsv output (and two forms of input, dict of lists of dicts, and dict of lists of lists).
@rneher merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases
). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.
@tryggvigy if you want to split .csv and .json generation, you will need to modify the store_json in utils, as this is used by most parsers to store both .tsv and .json.
One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.
Can you point me to an example of this?
merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.
Would this still be needed if I am generating the json from the tsv files on disk?
Seems like a few improvements can be made separately:
Please correct me if any of these do not make sense. I would like to work on the first bullet tomorrow if you agree this is what we should do.
One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.
Can you point me to an example of this?
Currently, each parser stores data to .tsv and .json using store_data()
in parsers/utils.py. In that function, we catch some corner cases (due to mix of world.tsv fom ecdc.py and other folder structure for most parsers) and eventually call store_json()
. In that function, we import current case_counts.json
, and pass that and new data to merge_cases()
. That function updates/adds data to the old dict, and the result is stored as case_counts.json again (until the next parser adds its content). Multiple data sources come from ecdc.py and cds.py, which query worldwide data APIs.
merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.
Would this still be needed if I am generating the json from the tsv files on disk?
What you are describing was the original setup, which assumes that parsers never provide overlapping results. Given that there are meta-APIs like ecdc and coronadatascraper.com, a solid mechanism to merge data sources was one of the things I found most important. This is independent of the .tsv vs .json discussion.
BTW: parsers/tsv.py
is a first parser I wrote to go over the .tsv files and parse into .json again. It's not really required at the moment, so there might be some bugs we didn't notice.
One thing one could do is get rid of the deep folder structure and simply have a directory for each parser
ecdc |-- France.tsv |-- Spain.tsv ... unitedstates |-- USA-Alaska.tsv |-- USA-New Jersey.tsv ... manuals |-- USA-CO-myFavoriteVillage.tsv ...
this would disambiguate the json parsing (keys already fixed), maintain human-readable intermediates, and allow for manually added tsv in case this becomes necessary.
Just an idea. @nnoll and @noleti -- open for suggestions here.
That folder structure looks great. It would simplify the code in utils.py, because now file names would match the keys in the main dict we use. The world.tsv
case handling would go away as well. An explicit folder for manual .tsvs would be great.
@tryggvigy Do you want to have a stab at that, or should I do it? As some of the related code in utils.py is from me, it could be easier for me. But if you want to dive into this, please go ahead!
I created https://github.com/neherlab/covid19_scenarios_data/issues/67 with explicit ideas on how to update utils.py
and parser interaction. I made it a separate issue to keep the discussion in this thread higher level. If you disagree, feel free to merge both issues
The first part of this is done and merged now. The second part, tsv validation and more robust .json generation remains. .tsv parsing is now done in scripts/tsv.py
, but no real validation is done yet.
I'm transferring this to the main repo as we are no longer going to keep this as submodule but rather just a directory within covid19_scenarios.
The .json validation is being addressed now in https://github.com/neherlab/covid19_scenarios/pull/390. In general, this is related to https://github.com/neherlab/covid19_scenarios/issues/283 (references to keep things synchronized).
This was mostly done in #390 and #434
The primary purpose of this repo is to aggregate parsers and sources to be used in
covid19_scenarios
where it is included as a submodule.In PR https://github.com/neherlab/covid19_scenarios/pull/267 automated updating of the app is discussed and one issue is robustness. Suggestions are