opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
73 stars 16 forks source link

Investigate using duckdb for rolling window queries #2379

Open ravenac95 opened 1 month ago

ravenac95 commented 1 month ago

What is it?

Mostly retroactive as this was a random hunch that seems to be working. Basically the thought was that for any rolling window query we could actually have duckdb load the dependent tables into memory and then run the queries. We could even get fairly smart with this if we are able to target specific partitions as well if things got large enough. The reason I wanted to try this is because it seemed that because of the way SQLMesh was scheduling runs the cache for trino never really got fully warm as it was running things so things ended up being slow.

ravenac95 commented 1 month ago

At the current rate of completion for the sqlmesh processing this is set to finish within ~20 hours. Compared to running this with trino which had issues with completing at all due to some issues with iceberg and, my hunch, with the delete/insert process that occurs with sqlmesh on our rolling window queries. I will need to look into what batch writing into iceberg looks like with this duckdb process.