Closed Marigold closed 2 years ago
I did a quick benchmark on covid dataset. While fetching the entire dataset previously took 0.5s, it now takes 1s when fetched from local parquet file. That's not bad and certainly worth having the smaller database. Moreover, fetching data could be later improved (e.g. by redirecting to parquet file instead of loading data with API and sending back in response in the same shape).
The new DB has also 1.4GB size (from all garden & backport datasets) and crawling backport and garden takes 8 mins.
@danyx23 did you have a chance to look at this? Do you think it is good to go? (it's going to change considerably in the future anyway once we start fetching metadata from MySQL)
@Marigold sorry forgot to reply here! Yes, I wrote in slack last week that both your change and Lars' github repo with the catalog work well together. I think we could switch to this. Crawl time should be much shorter and we already have the parquet files around on the analytics server. So yes, good to go IMHO!
Instead of loading data from remote catalog into DuckDB and storing it locally, use parquet files from local catalog. This results in way smaller database and less load.
The reason it uses local catalog instead of remote one is that DuckDB in version 0.4.0 fails on reading remote parquet files, e.g.
so until 0.4.1 is released we'll have to use local catalog.