more details about the tests

sanikolaev commented 3 years ago

Hi. The benchmark results look awesome. Would it be possible to provide more details about the tests:

how much RAM there's on the server?
how much data it takes on disk in Tensorbase and in Clickhouse?

Also is there any paper describing the data format in Tensorbase?

jinmingjian commented 3 years ago

@sanikolaev thanks for interesting. I am busy in too many things in these days:)

DRAM is 32*6=192GB (6-channel, 32GB per channel, standard config for xeon-sp single socket bare metal) NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache) and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)
the data is simply: 2-column, 32bit Integer per column(Datetime is implemented in 32bit for both CH and TB), 1.47B-row stripped NYC taxi dataset. NOTE: The number of total columns in a table is not important here in that we are talking about column-wise stores.
I am working on primary String support, so it will be soon to show some initial benchmark results from TPC-H. (the alpha website is released in advance before my imagination)
There is no paper because of time... The initial storage, in fact, is primary. The interesting is that how the data goes into storage. It does not use common LSM tree or similar like in CH and even mostly popular opensource peers. The bad of LSM tree is that you pay for fast writing in a long run. Further two questions you can ask here:
- Does the LSM tree achieve the global optimum for the 7x24 time-span servers?
- how fast if we discard LSM tree? And I think TensorBase gives out its innovative answers:smile:
The full open source could come faster than I thought. Before this happen, I am interesting to invite some early users/people/partners to join the work more quickly. If you and others are interested in this, you can connect me via any ways.

sanikolaev commented 3 years ago

NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache)

Why is it so? It doesn't seem practical to me to measure only hot queries. In real analytics while doing real queries the chance for a IO operation is high.

and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)

Yes, even if you measure only hot queries it will be interesting to see the results when the data can't be fully fit into RAM. Then you'll have to read from disk, then the storage format will be a key thing: how well the data is compressed, in how many iops you can read it, what exactly you want to keep in the limited RAM amount while processing the query etc.

I'll be happy to play with the opensource version when it's available.

jinmingjian commented 3 years ago

three fold:

Benchmark should happen in the context of "apple to apple". (In fact, many benchmarks fails to achieve this). To enable cache effect is just to enable "apple to apple". Because the loading of data from disks may be too complex. The cache mechanism (data loading from disks) are varied. You said compression is just one part of big picture to affect the loading. For example, I can use two layers of cache, but you just use a layer. You may be faster in the one-shot first loading, but I am better at global performance. This is why the modern x86 CPU has L1/2/3 three-layer cache.
Also as you said, hot data can be hit in the (memory) cache. So this comparison is still meaningful in good parts of real world cases. And this is why we have kinds of cache.
You are right. benchmarks which is IO bounding is another important scenario because data can't be fully fit into RAM. This is workload dependent. This is in fact TensorBase wants to solve than other opensource peers. It is possible to show interesting results in next time benchmark time. But the caveat is still "apple to apple". We just discuss this in that time.

tensorbase / tensorbase_frontier_edition

more details about the tests #2