polarsignals / frostdb

❄️ Coolest database around 🧊 Embeddable column database written in Go.
Apache License 2.0
1.29k stars 65 forks source link

Iceberg storage #824

Closed thorfour closed 4 months ago

thorfour commented 4 months ago

This adds a new storage implementation based on Apache Iceberg

The default storage for FrostDB performs a linear scan on all data files. This means that at minimum we have O(N) reads performed. Just to perform they query planning step.

Iceberg allows us to perform query planning down to the data file with O(1) reads. It extracts and groups metadata into a set of manifest files that allow us to perform fewer reads to determine which data files may contain useful data.

This implementation currently performs data file pruning only at the manifest layer and does not yet support pruning at the manifest list layer. That will be a future addition once table partitioning is supported in the iceberg-go library. However, being able to prune at the manifest layer should already be a marked improvement.

In addition to adding support for pruning at the manifest list level, future support could be added for utilities to prune/cleanup tables. As new snapshots are created, old snapshots and their metadata files are left around.

asubiotto commented 4 months ago

Also what is the long-term plan for our iceberg-go fork. Are we eventually going to upstream the changes? Seems like there's some stuff that's on the roadmap that might be interesting for us from a cursory view (e.g. scan planning)

thorfour commented 4 months ago

Also what is the long-term plan for our iceberg-go fork. Are we eventually going to upstream the changes? Seems like there's some stuff that's on the roadmap that might be interesting for us from a cursory view (e.g. scan planning)

In the medium term we'll definitely continue our fork as right now the Apache repo doesn't seem to be under active development ( most recent commit was 2 months ago). And ours has diverged quite a bit already.

Long term yes it would be nice if the Apache repo implemented a lot of the things we need but no idea what that timeline looks like.