secretflow / scql

SCQL (Secure Collaborative Query Language) is a system that allows multiple distrusting parties to run joint analysis without revealing their private data.
https://www.secretflow.org.cn/docs/scql/en/
Apache License 2.0
126 stars 46 forks source link

联表查询时执行计划问题 #382

Open daoshouchen opened 1 week ago

daoshouchen commented 1 week ago

Issue Type

Running

Have you searched for existing issues?

Yes

OS Platform and Distribution

linux

SCQL Version

SCQL.0.9.0b1

What happend and What you expected to happen.

select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude
from call_record r
         left join call_summary s on r.call_id = s.call_id
where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'
表call_record与表call_summary各200万数据,关联条件call_id,正常写sql应该都如下,r表call_id精确某条条件,对应关联查出s表数据。left join。但是会产生两个问题:

1、执行计划根据call_record 表r.call_id精确查询出数据,而call_summary表全表扫描后,在内存中进行求交,性能极差,理论上解析sql后执行计划可以将call_summary表也根据s.call_id精确查询出交集。
2、当加上 and s.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'条件后,执行计划依然是不对。修改成inner join后两库都精确查询并且快速匹配返回。

这块的执行计划是否有优化的空间?

Configuration used to run SCQL.

1、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude
from call_record r
         left join call_summary s on r.call_id = s.call_id
where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'

2、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude
from call_record r
         left join call_summary s on r.call_id = s.call_id
where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'
and s.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'

3、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude
from call_record r
         inner join call_summary s on r.call_id = s.call_id
where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'
and s.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'

SCQL log output.

scdb-log:2024-10-21 11:07:48.102111 INFO submit_and_get_handler.go:307 Execution Plan:
digraph G {
0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select r.call_id,r.lrrq from jiaotongju.call_record as r where r.call_id='ID-27251718675267_7782_2024-09-13_10:25:15',table_refs:[jiaotongju.call_record],],party:[jiaotong,]}"]
1 [label="runsql:{in:[],out:[Out:{t_2,t_3,},],attr:[sql:select s.attitude,s.call_id from alice.call_summary as s,table_refs:[alice.call_summary],],party:[alice,]}"]
2 [label="join:{in:[Left:{t_0,},Right:{t_3,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[jiaotong alice],join_type:0,psi_algorithm:0,],party:[jiaotong,alice,]}"]
3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,},],attr:[],party:[jiaotong,]}"]
4 [label="filter_by_index:{in:[Data:{t_2,t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_8,t_9,},],attr:[],party:[alice,]}"]
5 [label="copy:{in:[In:{t_6,},],out:[Out:{t_10,},],attr:[input_party_codes:jiaotong,output_party_codes:alice,],party:[jiaotong,alice,]}"]
6 [label="copy:{in:[In:{t_7,},],out:[Out:{t_12,},],attr:[input_party_codes:jiaotong,output_party_codes:alice,],party:[jiaotong,alice,]}"]
7 [label="publish:{in:[In:{t_10,t_12,t_8,},],out:[Out:{t_11,t_13,t_14,},],attr:[],party:[alice,]}"]
0 -> 2 [label = "t_0:{call_id:PRIVATE:STRING}"]
0 -> 3 [label = "t_0:{call_id:PRIVATE:STRING}"]
0 -> 3 [label = "t_1:{lrrq:PRIVATE:DATETIME}"]
1 -> 2 [label = "t_3:{call_id:PRIVATE:STRING}"]
1 -> 4 [label = "t_2:{attitude:PRIVATE:STRING}"]
1 -> 4 [label = "t_3:{call_id:PRIVATE:STRING}"]
2 -> 3 [label = "t_4:{call_id:PRIVATE:INT64}"]
2 -> 4 [label = "t_5:{call_id:PRIVATE:INT64}"]
3 -> 5 [label = "t_6:{call_id:PRIVATE:STRING}"]
3 -> 6 [label = "t_7:{lrrq:PRIVATE:DATETIME}"]
4 -> 7 [label = "t_8:{attitude:PRIVATE:STRING}"]
5 -> 7 [label = "t_10:{call_id:PRIVATE:STRING}"]
6 -> 7 [label = "t_12:{lrrq:PRIVATE:DATETIME}"]
jingshi-ant commented 1 week ago

1.方便的话,可以补充下3个query对应的Plan详情,方便我们比对diff。 2.200w数据量级并不大,即使全部扫描应该也是秒级别的?你们有性能预期吗?合适的话也可以补充下相关场景细节。

daoshouchen commented 1 week ago

问题1: 1、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude from call_record r left join call_summary s on r.call_id = s.call_id where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15' scdb-log:结果耗时 [fetch] 10 rows in set: (3m35.551977488s)

2024-10-21 11:07:48.102111 INFO submit_and_get_handler.go:307 Execution Plan: digraph G { 0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select r.call_id,r.lrrq from bob.call_record as r where r.call_id='ID-27251718675267_7782_2024-09-13_10:25:15',table_refs:[bob.call_record],],party:[bob,]}"] 1 [label="runsql:{in:[],out:[Out:{t_2,t_3,},],attr:[sql:select s.attitude,s.call_id from alice.call_summary as s,table_refs:[alice.call_summary],],party:[alice,]}"] 2 [label="join:{in:[Left:{t_0,},Right:{t_3,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[bob alice],join_type:0,psi_algorithm:0,],party:[bob,alice,]}"] 3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,},],attr:[],party:[bob,]}"] 4 [label="filter_by_index:{in:[Data:{t_2,t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_8,t_9,},],attr:[],party:[alice,]}"] 5 [label="copy:{in:[In:{t_6,},],out:[Out:{t_10,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 6 [label="copy:{in:[In:{t_7,},],out:[Out:{t_12,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 7 [label="publish:{in:[In:{t_10,t_12,t_8,},],out:[Out:{t_11,t_13,t_14,},],attr:[],party:[alice,]}"] 0 -> 2 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_1:{lrrq:PRIVATE:DATETIME}"] 1 -> 2 [label = "t_3:{call_id:PRIVATE:STRING}"] 1 -> 4 [label = "t_2:{attitude:PRIVATE:STRING}"] 1 -> 4 [label = "t_3:{call_id:PRIVATE:STRING}"] 2 -> 3 [label = "t_4:{call_id:PRIVATE:INT64}"] 2 -> 4 [label = "t_5:{call_id:PRIVATE:INT64}"] 3 -> 5 [label = "t_6:{call_id:PRIVATE:STRING}"] 3 -> 6 [label = "t_7:{lrrq:PRIVATE:DATETIME}"] 4 -> 7 [label = "t_8:{attitude:PRIVATE:STRING}"] 5 -> 7 [label = "t_10:{call_id:PRIVATE:STRING}"] 6 -> 7 [label = "t_12:{lrrq:PRIVATE:DATETIME}"]

2、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude from call_record r left join call_summary s on r.call_id = s.call_id where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15' and s.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15' scdb-log:结果耗时 [fetch] 10 rows in set: (3m18.077363226s)

2024-10-21 11:36:30.102111 INFO submit_and_get_handler.go:307 Execution Plan: digraph G { 0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select r.call_id,r.lrrq from bob.call_record as r where r.call_id='ID-27251718675267_7782_2024-09-13_10:25:15',table_refs:[bob.call_record],],party:[bob,]}"] 1 [label="runsql:{in:[],out:[Out:{t_2,t_3,},],attr:[sql:select s.attitude,s.call_id from alice.call_summary as s,table_refs:[alice.call_summary],],party:[alice,]}"] 2 [label="join:{in:[Left:{t_0,},Right:{t_3,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[bob alice],join_type:1,psi_algorithm:0,],party:[bob,alice,]}"] 3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,},],attr:[],party:[bob,]}"] 4 [label="filter_by_index:{in:[Data:{t_2,t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_8,t_9,},],attr:[],party:[alice,]}"] 5 [label="make_constant:{in:[],out:[Out:{t_10,},],attr:[scalar:ID-27251718675267_7782_2024-09-13_10:25:15,to_status:1,],party:[bob,alice,]}"] 6 [label="broadcast:{in:[In:{t_10,},ShapeRefTensor:{t_9,},],out:[Out:{t_11,},],attr:[],party:[alice,]}"] 7 [label="Equal:{in:[Left:{t_9,},Right:{t_11,},],out:[Out:{t_12,},],attr:[],party:[alice,]}"] 8 [label="copy:{in:[In:{t_12,},],out:[Out:{t_13,},],attr:[input_party_codes:alice,output_party_codes:bob,],party:[alice,bob,]}"] 9 [label="apply_filter:{in:[Filter:{t_13,},In:{t_6,t_7,},],out:[Out:{t_14,t_15,},],attr:[],party:[bob,]}"] 10 [label="apply_filter:{in:[Filter:{t_12,},In:{t_8,t_9,},],out:[Out:{t_16,t_17,},],attr:[],party:[alice,]}"] 11 [label="copy:{in:[In:{t_14,},],out:[Out:{t_18,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 12 [label="copy:{in:[In:{t_15,},],out:[Out:{t_20,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 13 [label="publish:{in:[In:{t_18,t_20,t_16,},],out:[Out:{t_19,t_21,t_22,},],attr:[],party:[alice,]}"] 0 -> 2 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_1:{lrrq:PRIVATE:DATETIME}"] 1 -> 2 [label = "t_3:{call_id:PRIVATE:STRING}"] 1 -> 4 [label = "t_2:{attitude:PRIVATE:STRING}"] 1 -> 4 [label = "t_3:{call_id:PRIVATE:STRING}"] 10 -> 13 [label = "t_16:{attitude:PRIVATE:STRING}"] 11 -> 13 [label = "t_18:{call_id:PRIVATE:STRING}"] 12 -> 13 [label = "t_20:{lrrq:PRIVATE:DATETIME}"] 2 -> 3 [label = "t_4:{call_id:PRIVATE:INT64}"] 2 -> 4 [label = "t_5:{call_id:PRIVATE:INT64}"] 3 -> 9 [label = "t_6:{call_id:PRIVATE:STRING}"] 3 -> 9 [label = "t_7:{lrrq:PRIVATE:DATETIME}"] 4 -> 10 [label = "t_8:{attitude:PRIVATE:STRING}"] 4 -> 10 [label = "t_9:{call_id:PRIVATE:STRING}"] 4 -> 6 [label = "t_9:{call_id:PRIVATE:STRING}"] 4 -> 7 [label = "t_9:{call_id:PRIVATE:STRING}"] 5 -> 6 [label = "t_10:{constant_data:PUBLIC:STRING}"] 6 -> 7 [label = "t_11:{constant_data:PRIVATE:STRING}"] 7 -> 10 [label = "t_12:{Equal_out:PRIVATE:BOOL}"] 7 -> 8 [label = "t_12:{Equal_out:PRIVATE:BOOL}"] 8 -> 9 [label = "t_13:{Equal_out:PRIVATE:BOOL}"] 9 -> 11 [label = "t_14:{call_id:PRIVATE:STRING}"] 9 -> 12 [label = "t_15:{lrrq:PRIVATE:DATETIME}"] }

3、select r.call_id as call_id, r.lrrq as lrrq, s.attitude as attitude from call_record r inner join call_summary s on r.call_id = s.call_id where r.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15' and s.call_id = 'ID-27251718675267_7782_2024-09-13_10:25:15'

scdb-log:结果耗时 [fetch] 10 rows in set: (243.214075ms)

2024-10-21 11:40:20.102111 INFO submit_and_get_handler.go:307 Execution Plan: digraph G { 0 [label="runsql:{in:[],out:[Out:{t_0,t_1,},],attr:[sql:select r.call_id,r.lrrq from bob.call_record as r where r.call_id='ID-27251718675267_7782_2024-09-13_10:25:15',table_refs:[bob.call_record],],party:[bob,]}"] 1 [label="runsql:{in:[],out:[Out:{t_2,t_3,},],attr:[sql:select s.attitude,s.call_id from alice.call_summary as s where s.call_id='ID-27251718675267_7782_2024-09-13_10:25:15',table_refs:[alice.call_summary],],party:[alice,]}"] 2 [label="join:{in:[Left:{t_0,},Right:{t_3,},],out:[LeftJoinIndex:{t_4,},RightJoinIndex:{t_5,},],attr:[input_party_codes:[bob alice],join_type:0,psi_algorithm:0,],party:[bob,alice,]}"] 3 [label="filter_by_index:{in:[Data:{t_0,t_1,},RowsIndexFilter:{t_4,},],out:[Out:{t_6,t_7,},],attr:[],party:[bob,]}"] 4 [label="filter_by_index:{in:[Data:{t_2,t_3,},RowsIndexFilter:{t_5,},],out:[Out:{t_8,t_9,},],attr:[],party:[alice,]}"] 5 [label="copy:{in:[In:{t_6,},],out:[Out:{t_10,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 6 [label="copy:{in:[In:{t_7,},],out:[Out:{t_12,},],attr:[input_party_codes:bob,output_party_codes:alice,],party:[bob,alice,]}"] 7 [label="publish:{in:[In:{t_10,t_12,t_8,},],out:[Out:{t_11,t_13,t_14,},],attr:[],party:[alice,]}"] 0 -> 2 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_0:{call_id:PRIVATE:STRING}"] 0 -> 3 [label = "t_1:{lrrq:PRIVATE:DATETIME}"] 1 -> 2 [label = "t_3:{call_id:PRIVATE:STRING}"] 1 -> 4 [label = "t_2:{attitude:PRIVATE:STRING}"] 1 -> 4 [label = "t_3:{call_id:PRIVATE:STRING}"] 2 -> 3 [label = "t_4:{call_id:PRIVATE:INT64}"] 2 -> 4 [label = "t_5:{call_id:PRIVATE:INT64}"] 3 -> 5 [label = "t_6:{call_id:PRIVATE:STRING}"] 3 -> 6 [label = "t_7:{lrrq:PRIVATE:DATETIME}"] 4 -> 7 [label = "t_8:{attitude:PRIVATE:STRING}"] 5 -> 7 [label = "t_10:{call_id:PRIVATE:STRING}"] 6 -> 7 [label = "t_12:{lrrq:PRIVATE:DATETIME}"] } 问题2 200W数据从数据库拉到内存消耗时间就很长,目前测试mysql库性能不会很强。

daoshouchen commented 1 week ago

另外一个问题:经常出现 too many connections,这应该是在上文计算耗时较长的情况下出现。这是哪个链接太多?

jingshi-ant commented 1 week ago

看了下下推的代码逻辑,left join对于右表没有做on condition的下推,inner join有,所以会有区别;

too many connections有上下文吗?可能是mysql,也可能是brpc

daoshouchen commented 1 week ago

1、 sql 1、2 的执行计划是可以优化吗? 2、too many connections 看着是引擎的,可以调整参数吗?我在配置里没有找到相关的配置项;上下文 RunExecutionPlan run jobs(f90acfda-8f8a-11ef-90dd-0242ac140002) failed, catch std::exception=[engine/datasource/odbc_adaptor.cc:56] catch unexpected Poco::Data::DataException: Connection attempt failed: Too many connections".

jingshi-ant commented 1 week ago
  1. 短期内您可以使用subquery的方式手动过滤;pushdown的优化我们需要再研究下(简单检索了下,left join中right join的下推很多数据库都没做,需要check下方案)
  2. 这个是链接数据库的报错,应该是mysql server side限制了连接。
jingshi-ant commented 1 week ago

https://stackoverflow.com/questions/55893354/predicate-pushdown-vs-on-clause