velvia / cassandra-gdelt

Experiments with the GDELT dataset and Cassandra schemas.
25 stars 12 forks source link

Can you show a query sample for Columnar Layout? #1

Open ajige opened 6 years ago

ajige commented 6 years ago

How it can be so fast? This layout places values of the same column from different rows together, and also serializes multiple row values into one cell. Can you explain the sentence in black?? I think it's not true.

velvia commented 6 years ago

Yes it is true - columnar compression. For example let's say you have these values:

Name Age Evan 18 Ajige 17 Sandra 20 Ted 25

Normally you store these into C* as follows: (Evan, 18), (Ajige, 17), ..... What is done here is the following. We store a highly compressed version of the following: (Evan, Ajige, Sandra, Ted), (18, 17, 20, 25) which is a much more optimized layout with far less I/O.

On Mar 20 2018, at 1:45 am, ajige notifications@github.com wrote:

How it can be so fast? This layout places values of the same column from different rows together, and also serializes multiple row values into one cell. Can you explain the sentence in black?? I think it's not true.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/velvia/cassandra-gdelt/issues/1), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA3265VUwslviUw4lNCcxSLsKa0yvkqks5tgMG9gaJpZM4SxfnB).

ajige commented 6 years ago

to be more specific: my original table schema as below:

CREATE TABLE columnar.tbl(
    pointdate date,
    pointtime timestamp,
    rowid int,
    ask float,
    ask_vol float,
    bid float,
    bid_vol float,
    borrow float,
    borrow_rate float,
    call_put text,
    expiry_date int,
PRIMARY KEY (pointdate, pointtime, rowid)
};    

To transform to columnar, it will be:

CREATE TABLE columnar.tbl_columnar(
    pointdate date,
    pointtime timestamp,
    rowid int,
    columnname text,
    value float,
    PRIMARY KEY (pointdate, pointtime, columnname,rowid)
    );

Thanks for help!!!

velvia commented 6 years ago

So the details are important -- it all depends on the type of query you run and the schema is important too.

In your case, a columnar layout would be more like: PRIMARY KEY (pointdate, columnname, pointtime) You see, the column name has to be the most important cluster key. The idea is that you are not going to read most columns in most analytical queries, so this layout reduces I/O significantly.

You also do not need the rowid. All the rows under a single point of time or a range of times would be encoded as a single chunk. Thus the number of rows in C* is also much less. Feel free to try this out using my FiloDB project - which is much higher level and abstracts away all of the work. You just have to use Spark to ingest and query. On Mar 22 2018, at 4:10 am, ajige notifications@github.com wrote:

to be more specific: my original table schema as below:

CREATE TABLE columnar.tbl( pointdate date, pointtime timestamp, rowid int, ask float, ask_vol float, bid float, bid_vol float, borrow float, borrow_rate float, call_put text, expiry_date int, PRIMARY KEY (pointdate, pointtime, rowid) }; To transform to columnar, it will be: CREATE TABLE columnar.tbl_columnar( pointdate date, pointtime timestamp, rowid int, columnname text, value float, PRIMARY KEY (pointdate, pointtime, rowid, columnname) );

So one row in tbl will be 8 rows in tbl_columnar, is it right? columnname will be "ask", "ask_vol",... But if columns are in different types, like in tbl_columnar value is defined as float(as most column in tbl is float), but call_put is in text type, how to store call_put in tbl_columnar?

I test your example in readme, the query for columnar format(like tbl_columnar format) is not that fast, it is only several time faster than tbl. exp: query for tbl is 30000 rows/second, query for tbl_columnar is 150000 rows/second.

is tbl_columnar defined as compact storage?

Can you provide a example for columnar schema?

Thanks for help!!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/velvia/cassandra-gdelt/issues/1#issuecomment-375266932), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA32wxInkihZ1vm1QBA7Eh2jBf7_Kn5ks5tg4apgaJpZM4SxfnB).

ajige commented 6 years ago

what does this mean: All the rows under a single point of time or a range of times would be encoded as a single chunk. Suppose we have two rows with same (pointdate, columnname, pointtime), If row id is not added in primary key, then when the second row is inserted, it will replace the first row. Right? How they can be encoded as a single chunk?

Thanks for help.

velvia commented 6 years ago

Hi Ajige,

Let me try to explain. On Mar 25 2018, at 6:30 pm, ajige notifications@github.com wrote:

what does this mean: All the rows under a single point of time or a range of times would be encoded as a single chunk. Suppose we have two rows with same (pointdate, columnname, pointtime), If row id is not added in primary key, then when the second row is inserted, it will replace the first row. Right? How they can be encoded as a single chunk?

What you do is, you store data from the second, third, fourth, etc rows (everything under the same point in time) in the same binary blob. Let's say that your binary blob is an Array of Doubles. You have 5 rows. You store a single blob consisting of an array of 5 doubles, 8 bytes each, for a total of 40 bytes. Then you insert this as ONE insert into the (pointdate, columnname, pointtime) primary key. Thus no data is lost, and it is much more efficient than normal inserts because you are inserting only once. You might ask, OK now how do you read and translate it? Yes this is much more work on the read and write side. Basically what Parquet does, but applied to Cassandra, instead of files. You can try out www.github.com/filodb/FiloDB, which applies this concept and does all the work for you. -Evan

Thanks for help. — You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/velvia/cassandra-gdelt/issues/1#issuecomment-376021805), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA328w_DQmct5XOH1t7fpNc8Klezg_Lks5tiESZgaJpZM4SxfnB).

ajige commented 6 years ago

Thanks. So the test result for columnar is achieved by FiloDB, or just Cassandra? If just Cassandra, I cannot reproduce that(the high query performance in columnar) on my server. In the example I provide, my query is ## select * from tbl_columnar where pointdate='2016-08-01' and pointtime <'20160801:240000' and pointtime > '20160801:000000' and columnname='ask'; Where is wrong? Can you give a schema and query example for columnar?

velvia commented 6 years ago

The test result is using the columnar compression. You can see in the code that I am writing chunks. This repo is basically the POC for an idea that turned into FiloDB eventually. FiloDB makes all of this much easier, you might just want to test that. You can use Spark and easily ingest and query.

Can you run the code as is and reproduce many of hte same results? I mean using the code from this repo exactly as is? You can substitute your data for that data and substitute the column names. One critical factor is how many rows you are putting into a single chunk. You should for example put 1000 rows into a single chunk, then you will get good performance.

On Mar 26 2018, at 7:46 pm, ajige notifications@github.com wrote:

Thanks. So the test result for columnar is achieved by FiloDB, or just Cassandra? If just Cassandra, I cannot reproduce that(the high query performance in columnar) on my server. In the example I provide, my query is ## select * from tbl_columnar where pointdate='2016-08-01' and pointtime <'20160801:240000' and pointtime > '20160801:000000' and columnname='ask'; Where is wrong? Can you give a schema and query example for columnar?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub (https://github.com/velvia/cassandra-gdelt/issues/1#issuecomment-376378368), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABA326j-Ba0a5zLyqIeciYMQKgI13l5Iks5tiagFgaJpZM4SxfnB).