pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.52k stars 1.98k forks source link

perf: Reduce the size of row encoding UTF-8 #19911

Open coastalwhite opened 3 days ago

coastalwhite commented 3 days ago

Before, row encoding and decoding would use the variable row encoding. Now, we use the fact that 0xFF is always an invalid UTF-8 character. To encode, the string with bytes b1, ..., bn becomes 0x02, b1 + 1, ..., bn + 1, 0x00. This way, we can just scan for the 0x00 when we want to know where to end. Empty strings are encoded as 0x01 and nulls as 0x00. Everything is bitwise inverted for descending.

This is always a size improvement and in particular saves massively for small strings. For example, encoding "a" went from 33 bytes to 3 bytes.

This is a continuation of #19874.