wanghenshui / wanghenshui.github.io

my blog, please do not fork
https://wanghenshui.github.io
Other
4 stars 1 forks source link

german string #150

Open wanghenshui opened 2 hours ago

wanghenshui commented 2 hours ago

arrow

* Short strings, length <= 12
  | Bytes 0-3  | Bytes 4-15                            |
  |------------|---------------------------------------|
  | length     | data (padded with 0)                  |

* Long strings, length > 12
  | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
  |------------|------------|------------|-------------|
  | length     | prefix     | buf. index | offset      |

buf index表示第几个buffer buf内的[offset,offset+length) 表示字符串

https://pola.rs/posts/polars-string-type/

https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

wanghenshui commented 2 hours ago

https://15721.courses.cs.cmu.edu/spring2023/slides/23-velox.pdf

wanghenshui commented 1 hour ago

还有一种常规设计

* Short strings, length <= 12
  | Bytes 0-3  | Bytes 4-15                            |
  |------------|---------------------------------------|
  | length     | data (padded with 0)                  |

* Long strings, length > 12
  | Bytes 0-3  | Bytes 4-7  | Bytes 8-15 |
  |------------|------------|------------|
  | length     | prefix     |  ptr     |

比较直观

bool isEqual(data128_t a, data128_t b) {
    if (a.v[0] != b.v[0]) return false;
    auto len = (uint32_t) a.v[0];
    if (len <= 12) return a.v[1] == b.v[1];
    return memcmp((char*) a.v[1], (char*) b.v[1], len) == 0;
}

对于小字符串 能省非常多

不过std::string也有sso,对于小字符串,但是没有prefix优势/16B传参优势

wanghenshui commented 1 hour ago

duckdb实现 duckdb/blob/main/src/include/duckdb/common/types/string_type.hpp

wanghenshui commented 1 hour ago

https://cedardb.com/blog/german_strings/

https://cedardb.com/blog/strings_deep_dive/