Update cait datasets - Githubissues

pabloarosado commented 2 years ago

Create a script to download and prepare CAIT datasets.
Add a generic module for sanity checks.
Add script to run sanity checks on CAIT datasets.
Create a script to upload CO2 dataset files to S3.
Update the main script to generate CO2 dataset files.
Add CAIT dataset files to repos (in a new "grapher" folder).
Update the CO2 dataset csv file and codebook, and remove the json and xlsx files (they will be hosted only in S3).
Update README (to have links to the CO2 dataset files in S3).

pabloarosado commented 2 years ago

Thanks @bnjmacdonald, I was trying to follow the philosophy of the energy-data repos. But I agree with you that it would be better to use importers or etl to upload the CAIT datasets to grapher. I could either do it now and cancel this PR, or leave it for a future improvement. I don't have a strong opinion on which option is better, @edomt ?

edomt commented 2 years ago

Indeed the general pipeline isn't optimal – but of course, it was already like that before. There are several things to take into account, including that:

importers will be deprecated in the coming months
etl isn't fully ready yet, so even an implementation of this pipeline in etl would need to be revisited at some point
there are a lot of outstanding & high-priority datasets to deal with.

What I would suggest to get rid of the confusion mentioned by @bnjmacdonald is that:

we move prepare_cait_datasets.py & its output into a cait folder in importers in its current state, i.e. without refactoring it as a "true" importer that upserts data into Grapher (because that part will be deprecated soon). So something that looks like the population folder.
The output of that script is then manually uploaded to Grapher
main.py in this co2 repo then fetches that Grapher dataset to use it

And at the next update of CAIT (presumably next year?), we'll transform the whole thing into a proper etl pipeline.

Let me know what you think :)

owid / co2-data

Update cait datasets #24