verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Joezhong fix 312 #314

Closed Beastjoe closed 5 years ago

Beastjoe commented 5 years ago

Related issue #312

Add a field "primarykeyColumnName" when creating scramble tables. If scramble tables have primary keys, it will use this field to add primary key.

pyongjoo commented 5 years ago

Thanks. Can you also test the latency with the below query?

The expected behavior is this. Let's Q1 be an internal query used by VerdictDB (i.e., with extra predicates on verdictdbblock). Then, issuing "explain Q1" should shows that "eq_ref" is used for joining tables.

SELECT SUM(l_extendedprice * (1 - l_discount))
FROM customer, orders, lineitem
WHERE c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey;
Beastjoe commented 5 years ago

I run the explain and there are three rows of results. I am not sure whether it is expected.

| id | select_type | table | partitions | type | possible_keys | key | key_len | ref |  rows | filtered | Extra  |
|  1 | SIMPLE      | vt2   | p2  | ALL    | PRIMARY   | NULL    | NULL    | NULL   |   95 |    10.00 | Using where; | Using temporary                       |
|  1 | SIMPLE      | vt1   | NULL       | eq_ref | PRIMARY       | PRIMARY | 4       | verdictstatement_test_sca0umuo.vt2.o_custkey |    1 |    10.00 | Using where  |
|  1 | SIMPLE      | vt3   | p0,p1,p2   | ALL    | NULL          | NULL    | NULL    | NULL     |  315 |     1.11 | Using where; Using join buffer (Block Nested Loop) |

Currently, lineitem_scramble doesn't have primary key and orders_scramble have (o_orderkey, verdictdbblock) as its primary key.

pyongjoo commented 5 years ago

The expected behavior is (1) that orders_scramble has "o_orderkey" as a primary key, and (2) that customer has "c_custkey" as a primary key.

Beastjoe commented 5 years ago

Yes, customer also has "c_custkey" as its primary key. For scramble tables, I think in MySQL, a primary key must include all columns in the table's partitioning function.

pyongjoo commented 5 years ago

For primary key retrieval, I think creating another node at this line is the cleaner approach. Of course, if the SqlSyntax does not support getPrimaryKey() (or does not return null), the node should be ignored. If the scramblingNode does not subscribe to the node, it is effectively ignored.

pyongjoo commented 5 years ago

Can you double-check the relationship between the partitoning key and the primary key? I just found the following query on mysql's website:

CREATE TABLE t5 (
    col1 INT NOT NULL,
    col2 DATE NOT NULL,
    col3 INT NOT NULL,
    col4 INT NOT NULL,
    PRIMARY KEY(col1, col2)
)
PARTITION BY HASH(col3)
PARTITIONS 4;
pyongjoo commented 5 years ago

My bad. The above query I pasted was an example of invalid statements.

pyongjoo commented 5 years ago

Then, can you just compare the latency of running the original join query and Verdict's internal query (on some combination of aggblock)?

Beastjoe commented 5 years ago

Sorry for the lone delay. TPCH1g take too long to generate scramble tables so I change the size to 100 Mb. For internal query of VerdictDB, it will take about 0.14s. The original one will take 1.17s to finish.

codecov-io commented 5 years ago

Codecov Report

Merging #314 into master will increase coverage by 0.11%. The diff coverage is 69.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #314      +/-   ##
==========================================
+ Coverage   70.46%   70.57%   +0.11%     
==========================================
  Files         167      168       +1     
  Lines       10994    11103     +109     
  Branches     1794     1823      +29     
==========================================
+ Hits         7746     7835      +89     
- Misses       2795     2802       +7     
- Partials      453      466      +13
Impacted Files Coverage Δ
...ictdb/core/scrambling/UniformScramblingMethod.java 93.75% <ø> (ø) :arrow_up:
.../core/scrambling/FastConvergeScramblingMethod.java 91.08% <ø> (+0.3%) :arrow_up:
...g/verdictdb/coordinator/ScramblingCoordinator.java 75.44% <ø> (ø) :arrow_up:
...ain/java/org/verdictdb/sqlsyntax/ImpalaSyntax.java 72.98% <ø> (ø) :arrow_up:
...rdictdb/core/querying/CreateTableAsSelectNode.java 89.66% <ø> (ø) :arrow_up:
...erdictdb/core/scrambling/HashScramblingMethod.java 77.56% <ø> (ø) :arrow_up:
.../java/org/verdictdb/connection/DbmsConnection.java 100% <ø> (ø) :arrow_up:
.../java/org/verdictdb/connection/StaticMetaData.java 60.42% <0%> (-1.28%) :arrow_down:
...g/verdictdb/connection/CachedMetaDataProvider.java 56.67% <0%> (-1.95%) :arrow_down:
...c/main/java/org/verdictdb/sqlsyntax/SqlSyntax.java 75% <100%> (+2.28%) :arrow_up:
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 17bbe13...9225341. Read the comment docs.

Beastjoe commented 5 years ago

I have just modified the codes based on the comments. Please have a check.

pyongjoo commented 5 years ago

@Beastjoe I think the added primary key option is showing effect (which is great), but the overall speed was not fast enough. So the latency numbers (for the join query above on TPCH1g) were

  1. original query (bypass): 1.0 mins 23.519 seconds
  2. verdict's internal query (picked one from the debug message): 15.767 second
  3. with verdict's interface: 2.0 mins 25.093 seconds

Maybe the use of Concurrent connection makes the query processing even slower for mysql. Can you change the logic so that VerdictContext uses JdbcConnection instead of ConcurrentJdbcConnection for mysql?

Beastjoe commented 5 years ago

Sure. I will take a look.

Beastjoe commented 5 years ago

@pyongjoo I have made the change. For testing, I am using 100Mb tpch. The original query takes 1.145s to finish. With verdict's interface, it may takes 5s. (I assume it means let VerdictStatement runs the sql without using stream)

pyongjoo commented 5 years ago

Can you test with TPCH1G? On my machine, creating a scramble for lineitem took about 2 mins.

Also, please analyze what's the most time-consuming part so that we can figure out how to address the issue.

Beastjoe commented 5 years ago

@pyongjoo I have tested on TPCH1g. On my machine, original query takes 8.377s to complete, while ours might take 10.043s. However, the result of verdictdb varies a lot(sometimes it only take 4s). I think it is due to the implement of in-memory database.

I find most of time is consumed in executing the selectAggExeuctionNode. Somehow, the performance is related to the value of maxNumberOfRunningNode in ExecutableRunner L198. The current value is 10. If I change it to 5 or smaller, the time can be reduced to 2.693s. You may try on your machine.

pyongjoo commented 5 years ago

Thanks for the update. Are those results with JdbcConnection in place of ConcurrentJdbcConnection?

Beastjoe commented 5 years ago

Yes. I have replaced ConcurrentJdbcConnection in VerdictContext for MySQL. You can have a check.

pyongjoo commented 5 years ago

I see. My expectation was that the queries issued to the same JdbcConnection are serialized by the backend dbms itself so does not impact the performance how many queries we issue. But, maybe it’s not the case.

Can you see if we can run individual agg nodes one by one for mysql? I think we should somehow alter NodeRunner, but haven’t look into details.

Beastjoe commented 5 years ago

I have updated the code. Now, it takes VerdictDB 1.2s to complete the query.

pyongjoo commented 5 years ago

It looks good. I just left some minor comments, but I will go ahead and perform some test with this code.

pyongjoo commented 5 years ago

I just pushed some to this branch to change the JdbcConnection.abort() logic slightly.

pyongjoo commented 5 years ago

I have merged this pull request into https://github.com/mozafari/verdictdb/pull/315