Open sanikolaev opened 3 years ago
@sanikolaev thanks for interesting. I am busy in too many things in these days:)
DRAM is 32*6=192GB (6-channel, 32GB per channel, standard config for xeon-sp single socket bare metal) NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache) and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)
the data is simply: 2-column, 32bit Integer per column(Datetime is implemented in 32bit for both CH and TB), 1.47B-row stripped NYC taxi dataset. NOTE: The number of total columns in a table is not important here in that we are talking about column-wise stores.
I am working on primary String support, so it will be soon to show some initial benchmark results from TPC-H. (the alpha website is released in advance before my imagination)
There is no paper because of time... The initial storage, in fact, is primary. The interesting is that how the data goes into storage. It does not use common LSM tree or similar like in CH and even mostly popular opensource peers. The bad of LSM tree is that you pay for fast writing in a long run. Further two questions you can ask here:
The full open source could come faster than I thought. Before this happen, I am interesting to invite some early users/people/partners to join the work more quickly. If you and others are interested in this, you can connect me via any ways.
NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache)
Why is it so? It doesn't seem practical to me to measure only hot queries. In real analytics while doing real queries the chance for a IO operation is high.
and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)
Yes, even if you measure only hot queries it will be interesting to see the results when the data can't be fully fit into RAM. Then you'll have to read from disk, then the storage format will be a key thing: how well the data is compressed, in how many iops you can read it, what exactly you want to keep in the limited RAM amount while processing the query etc.
I'll be happy to play with the opensource version when it's available.
three fold:
Hi. The benchmark results look awesome. Would it be possible to provide more details about the tests:
Also is there any paper describing the data format in Tensorbase?