matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 276 forks source link

[Bug]: lost connect when run SSB 100G Q4.3 no filter query #7505

Open aressu1985 opened 1 year ago

aressu1985 commented 1 year ago

Is there an existing issue for the same bug?

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):b8d83a2d1b04d2533242cf80a0a0fe671ed23fbb
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

The SSB 100G Q1.4 no filter query was failed by connection lost by mo, the doubtful error is :

023/01/07 23:09:04.589954 +0800 INFO frontend/util.go:499 query trace status {"connection_id": 1097, "statement": "select taskk _id, task_metadata_id, task_metadata_executor, task_metadata_context, task_metadata_option, task_parent_id, task_status, task_rr unner, task_epoch, last_heartbeat, result_code, error_msg, create_at, end_at from mo_task.sys_async_task where task_status = 0
order by task_id", "status": "success", "span": {"trace_id": "399be8dd-8e9d-11ed-902e-b07b25f8b524", "kind": "statement"}, "sess sion_info": "connectionId 1097"} 2023/01/07 23:09:05.131382 +0800 ERROR fileservice/local_etl_fs.go:73 error: file sys/logs/2023/01/07/rawlog/1673104144_7c4dccbb 4-4d3c-41f8-b482-5251dc7a41bf_ALL.csv already exists github.com/matrixorigin/matrixone/pkg/fileservice.(LocalETLFS).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/local_etl_fs.go:73 github.com/matrixorigin/matrixone/pkg/fileservice.(FileServices).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/file_services.go:116 github.com/matrixorigin/matrixone/pkg/util/export.(FSWriter).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:134 github.com/matrixorigin/matrixone/pkg/util/export.(FSWriter).WriteString /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:158 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/trace/buffer_pipe.go:111 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func2 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/trace/buffer_pipe.go:124 github.com/matrixorigin/matrixone/pkg/util/export.(bufferExportReq).handle /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/batch_processor.go:124 github.com/matrixorigin/matrixone/pkg/util/export.(MOCollector).doExport /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/batch_processor.go:359 2023/01/07 23:09:05.131477 +0800 ERROR fileservice/local_etl_fs.go:73 error: file sys/logs/2023/01/07/rawlog/1673104144_7c4dccbb 4-4d3c-41f8-b482-5251dc7a41bf_ALL.csv already exists github.com/matrixorigin/matrixone/pkg/fileservice.(LocalETLFS).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/local_etl_fs.go:73 github.com/matrixorigin/matrixone/pkg/fileservice.(FileServices).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/fileservice/file_services.go:116 github.com/matrixorigin/matrixone/pkg/util/export.(FSWriter).Write /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:134 github.com/matrixorigin/matrixone/pkg/util/export.(FSWriter).WriteString /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/util/export/fs_writer.go:158 github.com/matrixorigin/matrixone/pkg/util/trace.batchCSVHandler.NewItemBatchHandler.func1

2023/01/07 23:10:08.331652 +0800 ERROR hakeeper-client-backend morpc/backend.go:472 read from backend failed {"remote": "127.0.. 0.1:32001", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3", "error": "read tcp4 127.0.0.1:38894->127.0.0.1:32001: use of
closed network connection"} 2023/01/07 23:10:08.331855 +0800 ERROR hakeeper-client-backend morpc/backend.go:477 read loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3"} 2023/01/07 23:10:08.331914 +0800 ERROR hakeeper-client-backend v2@v2.0.3-0.20221212132037-abf2d4c05484/session.go:496 close conn neciton failed {"remote": "127.0.0.1:32001", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3", "session-id": 0, "error": "cc lose tcp4 127.0.0.1:38894->127.0.0.1:32001: use of closed network connection"} 2023/01/07 23:10:08.331945 +0800 INFO hakeeper-client-backend morpc/backend.go:364 write loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "529afce7-b795-435c-ba76-8268aa8756a3"} 2023/01/07 23:10:08.396192 +0800 ERROR rpc-client[hakeeper-client([connectToHAKeeper])] morpc/client.go:334 gc inactive backendd s task stopped 2023/01/07 23:10:08.396249 +0800 ERROR morpc/message.go:41 error: invalid input: timeout has invalid deadline github.com/matrixorigin/matrixone/pkg/common/morpc.RPCMessage.GetTimeoutFromContext /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/morpc/message.go:41 github.com/matrixorigin/matrixone/pkg/common/morpc.(server).startWriteLoop.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/morpc/server.go:286 github.com/matrixorigin/matrixone/pkg/common/stopper.(Stopper).doRunCancelableTask.func1 /data1/action-runner/_work/mo-nightly-regression/mo-nightly-regression/head/pkg/common/stopper/stopper.go:259 2023/01/07 23:10:08.437851 +0800 ERROR hakeeper-client-backend morpc/backend.go:472 read from backend failed {"remote": "127.0.. 0.1:32001", "backend-id": "7d1ad83c-48e7-4675-92e5-9d3a37fc8838", "error": "read tcp4 127.0.0.1:51014->127.0.0.1:32001: use of
closed network connection"} 2023/01/07 23:10:08.437999 +0800 ERROR hakeeper-client-backend morpc/backend.go:477 read loop stopped {"remote": "127.0.0.1:3200 01", "backend-id": "7d1ad83c-48e7-4675-92e5-9d3a37fc8838"}

mo-ssb-lostcon.tar.gz

Expected Behavior

No response

Steps to Reproduce

run ssb 100G Q4.3 no filter test:
select year(d_datekey) as year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit, c_region, s_nation, p_category
from lineorder
join date on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
group by year(d_datekey), s_city, p_brand, c_region, s_nation, p_category;

Additional information

No response

badboynt1 commented 1 year ago

tried this query again and it runs successfully. the memory cost is too high. probably same issue as https://github.com/matrixorigin/matrixone/issues/7292

badboynt1 commented 1 year ago

can not reproduce now. keep working

badboynt1 commented 1 year ago

can not reproduce now. keep working

badboynt1 commented 1 year ago

can not reproduce now. keep working

badboynt1 commented 1 year ago

can not reproduce now. keep working

fengttt commented 1 year ago
select year(d_datekey) as year, s_city, p_brand, sum(lo_revenue) - sum(lo_supplycost) as profit, 
c_region, s_nation, p_category 
from lineorder join dates on lo_orderdate = d_datekey join customer on lo_custkey = c_custkey join supplier on lo_suppkey = s_suppkey join part on lo_partkey = p_partkey 
group by year(d_datekey), s_city, p_brand, c_region, s_nation, p_category order by year, s_city, p_brand;

plan

This Q4.3, and the plan is

fengttt commented 1 year ago

According to @badboynt1 for 1G data, the query will produce 5 million rows. The main memory consumption is the hash agg. which should have used roughly 500MB memory. For 100G, ok, we will use 50G. The following discussion all based on this 50G assumption.

There are many problems with current execution engine.

  1. for this query, the execution should be pipelined. The agg should be executed in parallel. So suppose we have 2 pipelines running the hash agg, each should consume 25GB, so total still 50G.
  2. If we are doing 2 stage agg, first stage should not use more that 1G (I would do 64MB or something) by streaming bottom.
  3. We should not numerous small allocations.
  4. We do not have memory accountting for hash agg.
  5. If we out of memory budget, we should spill.

I am pretty sure all these apply to hash join as well. So we need to systematically take out all these problems.

badboynt1 commented 1 year ago

[cn.frontend] GuestMmuLimitation = 1000000000

this means mpool can allocate 1g memory at most in one session, or the sql will report an error.

badboynt1 commented 1 year ago

testing GuestMmuLimitation config

fengttt commented 1 year ago

This cannot be fixed/tested by config tweaks. Must fix the planner and executor.

nnsgmsone commented 1 year ago

I'll be at 0.8 fix.

nnsgmsone commented 1 year ago

转成feature #388把 @aressu1985

nnsgmsone commented 1 year ago

I will deal with the memory issue first

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等于https://github.com/matrixorigin/matrixone/issues/7891

nnsgmsone commented 1 year ago

等plan重构后处理这个问题

fengttt commented 1 year ago

7891 is done. So this one should be done as well?

nnsgmsone commented 1 year ago

需要spill才可以很好的运行,在64GB的机器上,内存的占用还是太大。。。根本原因是对于所有的数据而言,我们需要存一份hashtable+key vector这样的内存数据。资源占用比较冗余。

ouyuanning commented 10 months ago

@aunjgr 等spill好了应该就可以了。如果到时候还需要算子其他优化再转回来吧

aunjgr commented 9 months ago

not working on it today

aunjgr commented 9 months ago

not working on it today

aunjgr commented 9 months ago

not working on it today

aunjgr commented 8 months ago

在解决中移物联的issue

aunjgr commented 8 months ago

not working on it today

aunjgr commented 8 months ago

not working on it today

aunjgr commented 8 months ago

not working on it today

aunjgr commented 8 months ago

not working on it today

aunjgr commented 7 months ago

not working on it today

aunjgr commented 7 months ago

not working on it today

aunjgr commented 7 months ago

not working on it today

aunjgr commented 7 months ago

not working on it today

aunjgr commented 6 months ago

not working on it today

aunjgr commented 6 months ago

not working on it today

aunjgr commented 6 months ago

not working on it today

aunjgr commented 6 months ago

not working on it today

aunjgr commented 6 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 5 months ago

not working on it today

aunjgr commented 4 months ago

not working on it today

aunjgr commented 4 months ago

not working on it today

aunjgr commented 4 months ago

not working on it today