Several Issues on Iceberg

jicanghaixb commented 4 months ago

Previously, I used a bucket component mechanism similar to Thanos to synchronize all block directories. Before querying, I filtered the block directory based on the timestamp column and used this mechanism to periodically age blocks. Now, I have found that frostdb supports more advanced iceberg. Now, I want to switch to frostdb's iceberg mechanism and have found several issues:

issue 1: we can support "double"? because float64 column is used more frequently


func icebergSchemaToParquetSchema(schema *iceberg.Schema) *parquet.Schema {
    g := parquet.Group{}
    for _, f := range schema.Fields() {
        switch f.Type.Type() {
        case "long":
            g[f.Name] = parquet.Int(64)
        case "binary":
            g[f.Name] = parquet.String()
        }
    }
    return parquet.NewSchema("iceberg", g)
}

issue2: Can Parquet data file aging be supported? Currently, every time a parquet data file is created, a new snapshot will be generated. Although the snapshot supports the expireSnapshotsOlderThan parameter setting, it only applies to snapshot files that no longer contain old snapshots, but does not delete these old snapshots and data file

thorfour commented 4 months ago

Hey! Glad you're excited to use Iceberg too! This is still my active project so there's still work to be done before I'd recommend it for production use. However as for the issues you've raised.

1) This is supported now with the latest changes I've made in https://github.com/polarsignals/frostdb/pull/839

2) Yes! Iceberg maintenance is what I'm currently working on. I have expiry of metadata, and snapshots working. But need to also implement cleaning orphaned files as well as a way to age out data. So this is coming!

jicanghaixb commented 4 months ago

Thank you for your reply and excellent job! now frostdb iceberg support manifest and manifestEntry filter, it would be even better if support for rowgroup filters could be enhanced. example: consider an unordered column with a high dimension. If filtering is only done through upper and lower bounds, there will still be useless rowgroups that are read too much. frostdb supports column bloom filtering, but the property of column sorting must be set. sorting columns with a high dimension may consume a lot of resources. can we set sorting and bloom to not be strongly bound by schema.

thorfour commented 4 months ago

Block expiry has been implemented with the above PR to add a maintenance ability to the Iceberg storage engine.

polarsignals / frostdb

Several Issues on Iceberg #838