Is your feature request related to a problem? Please describe.
For some scene, when the flush flow is much small than source throughput, there may be a large number of request to hummock (which may be get or scan), and there would be a large number of small files in L0. It would make our system slow just to merge the read result of iterator.
But it is not necessary, for files in overlapping-level, we would cache the data of it in block-cache and every time we finished a checkpoint-barrier, we only add those files flushed by the CN self to local HummockVersion. So we can merge the data of these files directly rather than merge the read result.
Describe the solution you'd like
In the past time, we would add a new sub-level for every checkpoint-barrier. Now, we would keep using the origin overlapping-level and would not compact it to non-overlapping-level unless the size of it is too large.
For example, we set the size limit of overapping-level 256MB, and every checkpoint-barrier we only flush 32MB data to S3.
For meta-node, it means that we would not trigger any compact task until we have processed 8 checkpoint. Of course, during this time, there would be only one sub-level in this compaction-group.
For compute-node, every CN maintains a concurrently BTreeMap (or SkipList Map) in memory and add the flushed data to memory. The VersionDelta would tell CN whether it need to add the data to in-memory-map or just drop it because it has been compacted to a non-overlapping-level, or switch to a new in-memory-map because this sub-level is too large.
There are two ways to implement memtable in CN:
maintain in-memory-map in each local-hummock-store. It need some method to notify each local-store to switch to new memtable.
maintain in-memory-map in global version, it means that all state would share the same memtable.
Describe alternatives you've considered
we may need to implement a complex concurrently skiplist
CN need to a different version structure rather than protobuf-structure to maintains in-memory-map.
Is your feature request related to a problem? Please describe.
For some scene, when the flush flow is much small than source throughput, there may be a large number of request to hummock (which may be get or scan), and there would be a large number of small files in L0. It would make our system slow just to merge the read result of iterator.
But it is not necessary, for files in overlapping-level, we would cache the data of it in block-cache and every time we finished a checkpoint-barrier, we only add those files flushed by the CN self to local HummockVersion. So we can merge the data of these files directly rather than merge the read result.
Describe the solution you'd like
In the past time, we would add a new sub-level for every checkpoint-barrier. Now, we would keep using the origin overlapping-level and would not compact it to non-overlapping-level unless the size of it is too large.
For example, we set the size limit of overapping-level 256MB, and every checkpoint-barrier we only flush 32MB data to S3.
For meta-node, it means that we would not trigger any compact task until we have processed 8 checkpoint. Of course, during this time, there would be only one sub-level in this compaction-group.
For compute-node, every CN maintains a concurrently BTreeMap (or SkipList Map) in memory and add the flushed data to memory. The VersionDelta would tell CN whether it need to add the data to in-memory-map or just drop it because it has been compacted to a non-overlapping-level, or switch to a new in-memory-map because this sub-level is too large.
There are two ways to implement memtable in CN:
Describe alternatives you've considered
Additional context
No response