tikv / tikv

Distributed transactional key-value database, originally created to complement TiDB
https://tikv.org
Apache License 2.0
15.33k stars 2.14k forks source link

Multibatch‘s write efficiency is much worse than unordered write under a particular workload #9561

Open gengliqi opened 3 years ago

gengliqi commented 3 years ago

Bug Report

What version of TiKV are you using?

4.0.8

What operating system and CPU are you using?

I think it's doesn't matter.

Steps to reproduce

raftstore.apply-max-batch-size: 256
raftstore.apply-pool-size: 4
raftstore.apply-reschedule-duration: 200ms
raftstore.apply-yield-duration: 20ms
raftstore.hibernate-regions: true
raftstore.leader-transfer-max-log-lag: 256
raftstore.store-max-batch-size: 1024
raftstore.store-pool-size: 4
raftstore.store-reschedule-duration: 100ms

datacreate.tar.gz I use init_schema.sql to create table. Change the IP and port of tidb in tidb.sh and run it.

Then I change to use unordered write.

rocksdb.enable-multi-batch-write: false
rocksdb.enable-pipelined-write: false
rocksdb.enable-unordered-write: true

The definition of write efficiency is write batch size / write duration.

avg(tikv_engine_bytes_per_write{instance=~"$instance", db="$db",type="bytes_per_write_average"}) by (instance) / avg(tikv_engine_write_micro_seconds{instance=~"$instance", db="$db",type="write_average"}) by (instance) * 1000000

I find the multibatch's write efficiency is much worse than unordered write in kv db. image

yiwu-arbug commented 3 years ago

@gengliqi can you post write batch size and write duration metrics separately? want to see move details of what affects the write efficiency you defined.

gengliqi commented 3 years ago

@gengliqi can you post write batch size and write duration metrics separately? want to see move details of what affects the write efficiency you defined.

OK.

multibatch

write efficiency: avg write batch size / avg write duration

image

avg write batch size

image

avg write duration

image

unordered write

write efficiency: avg write batch size / avg write duration

image

avg write batch size

image

avg write duration

image

yiwu-arbug commented 3 years ago

@gengliqi thanks.

It may be easier to read if metrics are average on a wider time window, but it looks like the write batch size are similar. My wild guess would be multibatch write taking more time on synchronization, but we may need some profiling to see.

gengliqi commented 3 years ago

@yiwu-arbug @Connor1996 @Little-Wallace I test the pipelined write. It seems the multibatch write is a little better than pipelined write.

pipelined write

write efficiency: avg write batch size / avg write duration

image

avg write batch size

image

avg write duration

image

Connor1996 commented 3 years ago

It's as expected. So seems the multi-batch works well, it's mainly due to unordered_write has a better write duration which is as expected too.

BusyJay commented 3 years ago

From raft perspective, it only needs a consistent rocksdb view when creating snapshot. With #4379, we can ensure the rocksdb snapshot is immutable even under unordered write.

From transaction perspective, as long as there are still two phase commit, prewrite and commit are not in the same write batch, non-atomic write and mutable snapshot should not affect the correctness of transaction snapshot.

From raw KV perspective, it can still bring incorrect result as write batch is not atomic anymore. Though we have not promised write requests are atomic.

At least, it seems safe for TiKV to enable unordered writes if it's for TiDB usages only?

/cc @Little-Wallace @youjiali1995

yiwu-arbug commented 3 years ago

If we want to move forward with unordered write we need a GA plan.

Lily2025 commented 2 years ago

/type bug

tabokie commented 2 years ago

This is not a bug, removing the tag.

Little-Wallace commented 2 years ago

How about improving the pipeline write just like unordered-write? CockroachDB has finished some work on this. See more details in https://github.com/cockroachdb/pebble/blob/master/docs/rocksdb.md#commit-pipeline

Little-Wallace commented 2 years ago

I do not think it is a good idea to move forward to unordered-write. It may cause more problem in the future because we can not make the write batch atomic.