The dataset is non-representative

alexey-milovidov commented 1 year ago

It uses uniform pseudorandom number generators, for example, https://github.com/rockset/rockbench/blob/master/generator/document.go#L123

Therefore it is not suitable for testing insertion speed, compression, or database queries.

I recommend using my project: ClickHouse Obfuscator, for generating realistic but anonymized datasets: https://github.com/ClickHouse/ClickHouse/tree/master/programs/obfuscator

Also, take a look at the ClickBench methodology: https://github.com/ClickHouse/ClickBench/

Ideally, real datasets should be used, as here: https://clickhouse.com/docs/en/getting-started/example-datasets

dhruba commented 1 year ago

Thanks for referring to the Obfuscator, that is a really cool way to generate some representative samples of data. I looked at it a while ago, will look at it in more details now.

The goal for rockbench was simple: have a representative way to generate reallife event data. Most real-life event streams have these characterisrics: G1. update happen in streaming fashion and not batch uploads or file uploads G2. event-data is usually not OLAP-type-flat-records but nested objects with arrays of objects inside each record G3. event-data has updates to existing documents and not just appends of new records

Clickbench seems to be doing these: (a) does bulk uploads of date via file-uploads and not streaming updates (b) each record of the data set has the same number of fields (ie flat tables) (c) does not update an existing record in the dataset or does not delete records. Because of the above three reasons, it would not be representative of the workload that rockbench is aiming to measure.

Rockbench currently has support to test elastic, rockset and snowflake.

dhruba commented 1 year ago

Please re-open this issue if you have more thoughts on the reasoning explained above.

rockset / rockbench

The dataset is non-representative #23