polarsignals / frostdb

❄️ Coolest database around 🧊 Embeddable column database written in Go.
Apache License 2.0
1.29k stars 65 forks source link

L1 arrow compaction #433

Open thorfour opened 1 year ago

thorfour commented 1 year ago

It may be useful to have the option to compact L0 arrow records into L1 arrow records instead of Parquet.

thorfour commented 1 year ago

This may only be worth pursuing once the REE support changes are in FrostDB as well as the record sorting implementation https://github.com/apache/arrow/pull/34719 is completed

asubiotto commented 1 year ago

Agreed. I think moving to arrow-only in-mem would be the last step in this quarter.

gernest commented 8 months ago

I am thinking about this, I was wondering if this is the same as arrowutils.MergeRecords(arrow_parts...) |> arrowutils.SortRecord |> parts.NewArrowPart ?

asubiotto commented 8 months ago

Yes, although given the arrow parts should be merged on input, there probably isn't a need for the downstream sort. I'd also be interested in getting some L0 to L1 stats on how much memory we reduce through arrow compaction vs parquet compaction.

gernest commented 8 months ago

@asubiotto can you expand a bit about memory expectation between arrow/parquet compaction ?

I was always under the impression parquet+compression gives better memory saving than arrow.

asubiotto commented 8 months ago

Yes, this is why I'd be interested in getting some numbers so we are informed about the tradeoffs. Intuitively, dictionary encoding should go a long way. We've also been thinking about experimenting with run end encoding in arrow.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

asubiotto commented 4 months ago

I think it's still useful to keep this open.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.