Replicate data to local DuckDB or fetch dynamically from S3?

Marigold commented 2 years ago

Right now we replicate all data into local DuckDB. That has some advantages like query performance or no traffic between S3 and our server. On the other hand, some rare datasets like faostat or SDG are huge and replicating them takes a long time (though we might have to do it only once). I haven't synced all datasets yet, but I assume that our entire database would be ~10GB (there's a lot of space for optimisation though!) which isn't small.

A lot could be optimised, though I'm wondering from a philosophical perspective if we should invest time in it or consider fetching them directly from S3. Perhaps fetching them from S3/R2 is the future and we should go that way?

Might be worth checking how fast and feasible this is.

Marigold commented 2 years ago

We don't necessarily have to fetch from S3, but instead use local ETL catalog with prebuilt parquet files. (I wonder how would it perform against S3/R2 with colocated server). cc @danyx23

larsyencken commented 2 years ago

Yeah, been thinking about this too. The DuckDB ingestion is suuuper slow, considering everything's already in a columnar format. Part of that seems to be the network requests to fetch the data (we could use local files), part of that is the format conversion.

When I did early experiments with DuckDB, I batch converted feather to parquet and registered them in-memory with DuckDB, rather than consolidating a new DuckDB database.

I think we already want to be co-publishing data in multiple formats from the ETL (feather, csv, parquet). Perhaps if we already had the parquet files this way, then consuming them here would be much easier.

owid / data-api

Replicate data to local DuckDB or fetch dynamically from S3? #6