owid / data-api

API for accessing data from our data catalog.
MIT License
16 stars 4 forks source link

feature: switch from duckdb to parquet files #15

Closed Marigold closed 2 years ago

Marigold commented 2 years ago

Instead of loading data from remote catalog into DuckDB and storing it locally, use parquet files from local catalog. This results in way smaller database and less load.

The reason it uses local catalog instead of remote one is that DuckDB in version 0.4.0 fails on reading remote parquet files, e.g.

import duckdb
con = duckdb.connect("../duck.db")
path = "https://owid-catalog.nyc3.digitaloceanspaces.com/garden/ggdc/2020-10-01/ggdc_maddison/maddison_gdp.parquet"
con.execute(f"INSTALL httpfs;")
con.execute(f"LOAD httpfs;")
df = con.execute(f"select * from read_parquet('{path}')").fetch_df()
df

so until 0.4.1 is released we'll have to use local catalog.

Marigold commented 2 years ago

I did a quick benchmark on covid dataset. While fetching the entire dataset previously took 0.5s, it now takes 1s when fetched from local parquet file. That's not bad and certainly worth having the smaller database. Moreover, fetching data could be later improved (e.g. by redirecting to parquet file instead of loading data with API and sending back in response in the same shape).

The new DB has also 1.4GB size (from all garden & backport datasets) and crawling backport and garden takes 8 mins.

Marigold commented 2 years ago

@danyx23 did you have a chance to look at this? Do you think it is good to go? (it's going to change considerably in the future anyway once we start fetching metadata from MySQL)

danyx23 commented 2 years ago

@Marigold sorry forgot to reply here! Yes, I wrote in slack last week that both your change and Lars' github repo with the catalog work well together. I think we could switch to this. Crawl time should be much shorter and we already have the parquet files around on the analytics server. So yes, good to go IMHO!