:hammer: Turn on parquet co-generation

owid / etl

A compute graph for loading and transforming OWID's data

https://docs.owid.io/projects/etl

MIT License

87 stars 23 forks source link

:hammer: Turn on parquet co-generation #3525

Closed Marigold closed 2 weeks ago

Marigold commented 2 weeks ago

Implements https://github.com/owid/etl/issues/3490

Allow loading DEFAULT_FORMATS from env variable. This will let us generate both feather and parquet in production and push it into our data catalog in R2. Local development would still use only feather as default. @danyx23 do you see any value in generating parquet on staging servers?

We used to add all our metadata directly into parquet, but that made it inefficient and no one was using it, so we removed it. Metadata is still available in a sidecar [table].meta.json in the same folder as [table].parquet.

TODO after merging

[x] Add DEFAULT_FORMATS=feather,parquet env to production
[ ] Rebuild entire ETL (either by incrementing ETL_EPOCH or pandas version)

owidbot commented 2 weeks ago

Quick links (staging server): Site	Admin	Wizard	Docs

Login: ssh owid@staging-site-generate-parquet

chart-diff: ✅

No charts for review.

data-diff: ✅ No differences found

```diff Legend: +New ~Modified -Removed =Identical Details Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet ``` Automatically updated datasets matching _weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk_ are not included

Edited: 2024-11-12 05:12:27 UTC Execution time: 12.96 seconds

danyx23 commented 2 weeks ago

This is very nice, @Marigold, thanks! Now I can finally just

duckdb
> from 'data/garden/energy/2024-06-20/primary_energy_consumption/primary_energy_consumption.parquet' limit 10

I don't think we need this on the staging servers for now and we can switch once we are asked for it or a need arises?

If we switch this on in production, will this cause any issues with anything in the existing catalog that might rely on feather files?

Marigold commented 2 weeks ago

If we switch this on in production, will this cause any issues with anything in the existing catalog that might rely on feather files?

That's very unlikely. We used to publish both for a long time and never ran into any issues.

I'm not going to rebuild ETL catalog yet, but will wait for nullable types that should be ready soon.