Improve interaction with the app

rneher commented 4 years ago

The primary purpose of this repo is to aggregate parsers and sources to be used in covid19_scenarios where it is included as a submodule.

In PR https://github.com/neherlab/covid19_scenarios/pull/267 automated updating of the app is discussed and one issue is robustness. Suggestions are

separate parsing and tsv generations from json generation. tsv can be manually checked, committed and then jsons are generated from there. potentially change folder structure to be flat.
make json generation more robust.

tryggvigy commented 4 years ago

@rneher I can work on separating the json generation out as a first step. Looks like that would involve moving https://github.com/neherlab/covid19_scenarios_data/blob/master/generate_data.py#L28-L63 (except for the --parsers debug part) into a separate script.

Does that sound good?

rneher commented 4 years ago

Each parser has line like this: https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/canada.py#L98

the jsons are currently generated straight from the fetched output

https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/utils.py#L199

and the keys of sub-country levels (e.g states, provinces) are pre-fixed with USA- or FRA- etc. o avoid Georgia (country) and Georgia (state) clashes. https://github.com/neherlab/covid19_scenarios_data/blob/master/parsers/utils.py#L188

One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.

One thing one could do is get rid of the deep folder structure and simply have a directory for each parser

ecdc 
|-- France.tsv
|-- Spain.tsv
...

unitedstates
|-- USA-Alaska.tsv
|-- USA-New Jersey.tsv
...

manuals
|-- USA-CO-myFavoriteVillage.tsv
...

this would disambiguate the json parsing (keys already fixed), maintain human-readable intermediates, and allow for manually added tsv in case this becomes necessary.

Just an idea. @nnoll and @noleti -- open for suggestions here.

noleti commented 4 years ago

As the json and tsv generation are mostly done by the same data in utils.py, the main reason one is successful and the other fails would be a) a bug (like mine I fixed this morning, sorry!), b) as everything is put in one json, if the script fails while updating the .json, it might become corrupted. That being said, some of the code in utils.py is more complex than it should because we support both json and tsv output (and two forms of input, dict of lists of dicts, and dict of lists of lists).

@rneher merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.

@tryggvigy if you want to split .csv and .json generation, you will need to modify the store_json in utils, as this is used by most parsers to store both .tsv and .json.

tryggvigy commented 4 years ago

One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.

Can you point me to an example of this?

merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.

Would this still be needed if I am generating the json from the tsv files on disk?

tryggvigy commented 4 years ago

Seems like a few improvements can be made separately:

fetching and json generation can be separated for robustness and paving a way for future automation (in a separate script, generating json from the existing tsv files instead of directly from fetched output)
The folder structure can be flattened like @rneher suggested
A tsv file validator can be created to gain further confidence

Please correct me if any of these do not make sense. I would like to work on the first bullet tomorrow if you agree this is what we should do.

noleti commented 4 years ago

One subtlety is that some countries are covered by multiple sources. @noleti has thought more about this than I have.

Can you point me to an example of this?

Currently, each parser stores data to .tsv and .json using store_data() in parsers/utils.py. In that function, we catch some corner cases (due to mix of world.tsv fom ecdc.py and other folder structure for most parsers) and eventually call store_json(). In that function, we import current case_counts.json, and pass that and new data to merge_cases(). That function updates/adds data to the old dict, and the result is stored as case_counts.json again (until the next parser adds its content). Multiple data sources come from ecdc.py and cds.py, which query worldwide data APIs.

merging multiple entries for the same country is currently done during the update of the case_counts.json (in utils.py merge_cases). If we have two countries with the same name, for each day a check if done if additional keys exist, and they are added to the old entries. No data is overwritten by the second dataset.

Would this still be needed if I am generating the json from the tsv files on disk?

What you are describing was the original setup, which assumes that parsers never provide overlapping results. Given that there are meta-APIs like ecdc and coronadatascraper.com, a solid mechanism to merge data sources was one of the things I found most important. This is independent of the .tsv vs .json discussion.

BTW: parsers/tsv.py is a first parser I wrote to go over the .tsv files and parse into .json again. It's not really required at the moment, so there might be some bugs we didn't notice.

noleti commented 4 years ago

One thing one could do is get rid of the deep folder structure and simply have a directory for each parser
ecdc 
|-- France.tsv
|-- Spain.tsv
...

unitedstates
|-- USA-Alaska.tsv
|-- USA-New Jersey.tsv
...

manuals
|-- USA-CO-myFavoriteVillage.tsv
...
this would disambiguate the json parsing (keys already fixed), maintain human-readable intermediates, and allow for manually added tsv in case this becomes necessary.

Just an idea. @nnoll and @noleti -- open for suggestions here.

That folder structure looks great. It would simplify the code in utils.py, because now file names would match the keys in the main dict we use. The world.tsv case handling would go away as well. An explicit folder for manual .tsvs would be great.

@tryggvigy Do you want to have a stab at that, or should I do it? As some of the related code in utils.py is from me, it could be easier for me. But if you want to dive into this, please go ahead!

noleti commented 4 years ago

I created https://github.com/neherlab/covid19_scenarios_data/issues/67 with explicit ideas on how to update utils.py and parser interaction. I made it a separate issue to keep the discussion in this thread higher level. If you disagree, feel free to merge both issues

noleti commented 4 years ago

The first part of this is done and merged now. The second part, tsv validation and more robust .json generation remains. .tsv parsing is now done in scripts/tsv.py, but no real validation is done yet.

nnoll commented 4 years ago

I'm transferring this to the main repo as we are no longer going to keep this as submodule but rather just a directory within covid19_scenarios.

noleti commented 4 years ago

The .json validation is being addressed now in https://github.com/neherlab/covid19_scenarios/pull/390. In general, this is related to https://github.com/neherlab/covid19_scenarios/issues/283 (references to keep things synchronized).

ivan-aksamentov commented 4 years ago

This was mostly done in #390 and #434

neherlab / covid19_scenarios

Improve interaction with the app #343