risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.82k stars 567 forks source link

Design efficient in-memory row format #3150

Closed TennyZhuang closed 2 years ago

TennyZhuang commented 2 years ago

Memory usage in the state is low cost-efficiency. Currently, we store Row in the state, which is an alias of Vec<Datum>.

Vec<Datum> is wasteful:

  1. A Datum will cost 32 bytes, but an INT only costs 4 bytes.
  2. We can use a bitmap to store the null so that one byte can represent 8 fields. What's more, non-nullable fields don't need any space.
  3. For var-length types or nested types, there will be many allocations, which can be reduced to only one.

In fact, the rows in one state will always have the same format, we can significantly reduce the memory cost by introducing some schemaless memory format.

There are some requirements for the format:

  1. Doesn't need to be decoded while using.
  2. Can be referenced by field without copying the field's data.
  3. Friendly for schema change (we can reserve a simple header version and leave the problem later).

I guess the value after the refactor will be large enough, so that indexmap may be helpful, but that needs a benchmark.

We can refer to FlatBuffer or something else in other databases, that needs investigation.

The value's encoding may be the same as #396, not sure.

fuyufjh commented 2 years ago

The word "schemaless" in the title scares me 🤪. Let me make it less ambiguous.

TennyZhuang commented 2 years ago

We decide to use value encoding now, closed.