Closed Beastjoe closed 5 years ago
Thanks. Can you also test the latency with the below query?
The expected behavior is this. Let's Q1 be an internal query used by VerdictDB (i.e., with extra predicates on verdictdbblock). Then, issuing "explain Q1" should shows that "eq_ref" is used for joining tables.
SELECT SUM(l_extendedprice * (1 - l_discount))
FROM customer, orders, lineitem
WHERE c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey;
I run the explain and there are three rows of results. I am not sure whether it is expected.
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | vt2 | p2 | ALL | PRIMARY | NULL | NULL | NULL | 95 | 10.00 | Using where; | Using temporary |
| 1 | SIMPLE | vt1 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | verdictstatement_test_sca0umuo.vt2.o_custkey | 1 | 10.00 | Using where |
| 1 | SIMPLE | vt3 | p0,p1,p2 | ALL | NULL | NULL | NULL | NULL | 315 | 1.11 | Using where; Using join buffer (Block Nested Loop) |
Currently, lineitem_scramble
doesn't have primary key and orders_scramble
have (o_orderkey, verdictdbblock) as its primary key.
The expected behavior is (1) that orders_scramble has "o_orderkey" as a primary key, and (2) that customer has "c_custkey" as a primary key.
Yes, customer also has "c_custkey" as its primary key. For scramble tables, I think in MySQL, a primary key must include all columns in the table's partitioning function.
For primary key retrieval, I think creating another node at this line is the cleaner approach. Of course, if the SqlSyntax does not support getPrimaryKey() (or does not return null), the node should be ignored. If the scramblingNode
does not subscribe to the node, it is effectively ignored.
Can you double-check the relationship between the partitoning key and the primary key? I just found the following query on mysql's website:
CREATE TABLE t5 (
col1 INT NOT NULL,
col2 DATE NOT NULL,
col3 INT NOT NULL,
col4 INT NOT NULL,
PRIMARY KEY(col1, col2)
)
PARTITION BY HASH(col3)
PARTITIONS 4;
My bad. The above query I pasted was an example of invalid statements.
Then, can you just compare the latency of running the original join query and Verdict's internal query (on some combination of aggblock)?
Sorry for the lone delay. TPCH1g take too long to generate scramble tables so I change the size to 100 Mb. For internal query of VerdictDB, it will take about 0.14s. The original one will take 1.17s to finish.
Merging #314 into master will increase coverage by
0.11%
. The diff coverage is69.85%
.
@@ Coverage Diff @@
## master #314 +/- ##
==========================================
+ Coverage 70.46% 70.57% +0.11%
==========================================
Files 167 168 +1
Lines 10994 11103 +109
Branches 1794 1823 +29
==========================================
+ Hits 7746 7835 +89
- Misses 2795 2802 +7
- Partials 453 466 +13
Impacted Files | Coverage Δ | |
---|---|---|
...ictdb/core/scrambling/UniformScramblingMethod.java | 93.75% <ø> (ø) |
:arrow_up: |
.../core/scrambling/FastConvergeScramblingMethod.java | 91.08% <ø> (+0.3%) |
:arrow_up: |
...g/verdictdb/coordinator/ScramblingCoordinator.java | 75.44% <ø> (ø) |
:arrow_up: |
...ain/java/org/verdictdb/sqlsyntax/ImpalaSyntax.java | 72.98% <ø> (ø) |
:arrow_up: |
...rdictdb/core/querying/CreateTableAsSelectNode.java | 89.66% <ø> (ø) |
:arrow_up: |
...erdictdb/core/scrambling/HashScramblingMethod.java | 77.56% <ø> (ø) |
:arrow_up: |
.../java/org/verdictdb/connection/DbmsConnection.java | 100% <ø> (ø) |
:arrow_up: |
.../java/org/verdictdb/connection/StaticMetaData.java | 60.42% <0%> (-1.28%) |
:arrow_down: |
...g/verdictdb/connection/CachedMetaDataProvider.java | 56.67% <0%> (-1.95%) |
:arrow_down: |
...c/main/java/org/verdictdb/sqlsyntax/SqlSyntax.java | 75% <100%> (+2.28%) |
:arrow_up: |
... and 22 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 17bbe13...9225341. Read the comment docs.
I have just modified the codes based on the comments. Please have a check.
@Beastjoe I think the added primary key option is showing effect (which is great), but the overall speed was not fast enough. So the latency numbers (for the join query above on TPCH1g) were
Maybe the use of Concurrent connection makes the query processing even slower for mysql. Can you change the logic so that VerdictContext uses JdbcConnection instead of ConcurrentJdbcConnection for mysql?
Sure. I will take a look.
@pyongjoo I have made the change. For testing, I am using 100Mb tpch. The original query takes 1.145s to finish. With verdict's interface, it may takes 5s. (I assume it means let VerdictStatement runs the sql without using stream)
Can you test with TPCH1G? On my machine, creating a scramble for lineitem took about 2 mins.
Also, please analyze what's the most time-consuming part so that we can figure out how to address the issue.
@pyongjoo I have tested on TPCH1g. On my machine, original query takes 8.377s to complete, while ours might take 10.043s. However, the result of verdictdb varies a lot(sometimes it only take 4s). I think it is due to the implement of in-memory database.
I find most of time is consumed in executing the selectAggExeuctionNode. Somehow, the performance is related to the value of maxNumberOfRunningNode in ExecutableRunner L198. The current value is 10. If I change it to 5 or smaller, the time can be reduced to 2.693s. You may try on your machine.
Thanks for the update. Are those results with JdbcConnection in place of ConcurrentJdbcConnection?
Yes. I have replaced ConcurrentJdbcConnection in VerdictContext for MySQL. You can have a check.
I see. My expectation was that the queries issued to the same JdbcConnection are serialized by the backend dbms itself so does not impact the performance how many queries we issue. But, maybe it’s not the case.
Can you see if we can run individual agg nodes one by one for mysql? I think we should somehow alter NodeRunner, but haven’t look into details.
I have updated the code. Now, it takes VerdictDB 1.2s to complete the query.
It looks good. I just left some minor comments, but I will go ahead and perform some test with this code.
I just pushed some to this branch to change the JdbcConnection.abort() logic slightly.
I have merged this pull request into https://github.com/mozafari/verdictdb/pull/315
Related issue #312
Add a field "primarykeyColumnName" when creating scramble tables. If scramble tables have primary keys, it will use this field to add primary key.