wanghenshui / wanghenshui.github.io

my blog, please do not fork
https://wanghenshui.github.io
Other
4 stars 1 forks source link

数据编码格式探究? #107

Open wanghenshui opened 6 months ago

wanghenshui commented 6 months ago

memcomparable vs kvrocks encoding vs pika encoding,需要搞个demo

meta信息和具体数据分散,数据局部性可能不好

blobdb更是加重了这种局部性对影响

测试角度

wanghenshui commented 6 months ago

memcomparable也有两种设计

[group1][marker1]…[groupN][markerN]

group 是补零之后的8字节切片

markder = 0xFF - 补零数量

举例:

[] -> [0, 0, 0, 0, 0, 0, 0, 0, 247]

[1, 2, 3] -> [1, 2, 3, 0, 0, 0, 0, 0, 250]

[1, 2, 3, 0] -> [1, 2, 3, 0, 0, 0, 0, 0, 251]

[1, 2, 3, 4, 5, 6, 7, 8] -> [1, 2, 3, 4, 5, 6, 7, 8, 255, 0, 0, 0, 0, 0, 0, 0, 0, 247]


https://haxisnake.github.io/2020/11/06/TIDB%E6%BA%90%E7%A0%81%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0-%E5%9F%BA%E6%9C%AC%E7%B1%BB%E5%9E%8B%E7%BC%96%E8%A7%A3%E7%A0%81%E6%96%B9%E6%A1%88/

2 myrocks 

varchar类型为了节省空间处理起来就复杂多了 以源码中的注释为例

const int VARCHAR_CMP_LESS_THAN_SPACES = 1; const int VARCHAR_CMP_EQUAL_TO_SPACES = 2; const int VARCHAR_CMP_GREATER_THAN_SPACES = 3;

Example: if fpi->m_segment_size=5, and the collation is latin1_bin:

'abcd\0' => [ 'abcd' ]['\0 ' ] 'abcd' => [ 'abcd' ] 'abcd ' => [ 'abcd' ] 'abcdZZZZ' => [ 'abcd' ][ 'ZZZZ' ]

字符串以m_segment_size分段存储,每段前m_segment_size-1个字符是内容,最后一个字符表示和空格比较,VARCHAR_CMP_EQUAL同时也表示字符串结束

例子中m_segment_size为5,实际实现上值为9

这里unpace_info会比较复杂,字符串collation不同unpace_info也不同,unpace_info需要保存collation之间的转换映射关系, 具体可以查看函数(rdb_init_collation_mapping)



https://developer.aliyun.com/article/62648
wanghenshui commented 6 months ago

https://xie.infoq.cn/article/e444854844038cbbf06707e89

wanghenshui commented 6 months ago

https://github.com/apache/kvrocks/wiki/Kvrocks-%E8%AE%BE%E8%AE%A1%E4%B8%8E%E5%AE%9E%E7%8E%B0