influxdata / influxdb-comparisons

Code for comparison write ups of InfluxDB and other solutions
MIT License
306 stars 112 forks source link

Cassandra: do not use batch load #161

Closed Sasasu closed 4 years ago

Sasasu commented 4 years ago

Cassandra batch is not for improving performance. only for ensuring atomicity and isolation. use batch to load data is a common bad design.

see here and here

In my single value use case, the benchmark:

batch size = 300 loaded 52358400 items in 347.260693sec with 10 workers (mean point rate 150775.486626/sec, mean value rate 150775.486626/s, 15.52MB/sec from stdin)

use insert loaded 52358400 items in 78.183677sec with 10 workers (mean point rate 669684.545983/sec, mean value rate 669684.545983/s, 68.92MB/sec from stdin)

test with HDD (IOPS 350, 15M/s sequence IO) 2xlarge VM.

table schema: X(tsuid TEXT, time bigint, value double, primary key(tsuid, time))

But influxdb still has a huge advantage on disk space usage.

Sasasu commented 4 years ago

small batch is faster for one C* node

batch size = 1  > 300k/s
batch size = 30 > 600k/s
batch size = 70 > 700k/s
batch size = 90 > 700k/s