neherlab / covid19_scenarios_data

Data preprocessing scripts and preprocessed data storage for COVID-19 Scenarios project
https://github.com/neherlab/covid19_scenarios
Other
41 stars 36 forks source link

Structure for parsed data files #12

Closed noleti closed 4 years ago

noleti commented 4 years ago

The current approach with a World.tsv AND individual .tsv files for subcountries/cities is confusing (at least to me). I see that in covid19_scenarios/tools/collect_case_data_to_json.py, the files are aggregated and integrated into one big JSON again, with the individual .tsv being preferred over world.tsv. The country-region info is pulled from the path (and apparently not used further).

Why not just have one tsv (or json) in this repo, and have the country-region data in that json as well? That would then also get rid of the parsing in covid19_scenarios. Having everything in JSON would make it easy to parse the file for each parser, and then add more entries. JSON is of course not as easily editable by hand, but given the scale of data we are talking about, manual editing is likely not feasible any more in any case.

noleti commented 4 years ago

I noted that if we json.dumps(cases,fn) before flatten(cases) in the parsers, we get the same json structure that is reparsed later in covid19_scenarios. So going full json seems really easy - each parser would create/update a global case_counts.json. We save ourselves the headache of a file system hierarchy containing metadata, and converting to/from csv. The only thing that is lost is human-readable data files. In my fork, I have a cds.py that produces such a case_counts.json from the coronadataparser.com already, including data on regions such as USA-OK-Love County.

nnoll commented 4 years ago

I agree that right now the data flow is suboptimal. As you state, the original intent behind both a World.tsv and individual .tsv files is that the data found in individual location directories take higher precedence in the final json than the World.tsv, i.e. the data in World.tsv is used provided we don't have a specialized parser for a given region as the world data aggregators seem to only provide case/death numbers. Individual regions should strive for the full dataset.

I would be open to condensing the structure, as I'm not fully happy with it either, but we have to have a bulletproof method to ensure country names are correct (which the folder structure does, albeit in a bulky way). I'm also not sure getting rid of a user-editable input as I like to have the option for hospitals/clinicians to be able to input their own data directly.

noleti commented 4 years ago

We can enforce correct country names by using/allowing only names from country_codes.csv, no? It would likely get more messy with states/counties, as we don't have an authorative list for them. I still don't see the advantage of using file/folder names. I see the advantage for optional .csv support. How about the following:

nnoll commented 4 years ago

This seems like a good solution. If you want to have a go at this restructuring, it would be greatly appreciated. I'll then run the tests on my end with the full build to make sure covid19_scenarios can't see the difference.

noleti commented 4 years ago

Sure, I will take a shot at this later tonight or tomorrow. Thanks for your great work on the covid19_scenarios

noleti commented 4 years ago

I created https://github.com/neherlab/covid19_scenarios_data/pull/23 now, which essentially implements the discussed without breaking legacy compatibility for now. If you are in general fine, I can also add a Readme for .tsv contributors, and we would then get rid of the subdirectories of case-counts. As all data is now merged semi-intelligently, I don't see a reason for this overhead any more.

remaining steps for complete switch to .json:

open issues from discussion above:

rneher commented 4 years ago

I kind of like the tsv. Since it only adds a line a day per file, this doesn't generate much overhead. I have been using the case-counts.json directly for covid19_scenarios and we should be able to remove the script producing the json there. I updated the germany parser to produce the json output.

noleti commented 4 years ago

Sure, it's your project after all. If both json and .tsv output of parsers are to be kept, then the Readme of this repo should probably mention how to add the json output, as currently tsv.py does not pick up files in subdirectories of case-counts/. So the output of a new parser that only generates .tsv in a subdirectory will not be picked up in the case_counts.json. Alternatively, tsv.py can be extended to parse subdirs.

noleti commented 4 years ago

I think we can close this issue by now, as store_data() takes care of both .tsv and .json export, and the backend uses the .json directly now.