matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 277 forks source link

[Bug]: [2.0-dev big data regression] create index report 'rpc timeout'. #20162

Open Ariznawlll opened 2 days ago

Ariznawlll commented 2 days ago

Is there an existing issue for the same bug?

Branch Name

2.0-dev

Commit ID

b76c1a8db04cfefde6513c48c3948b72457c4c82

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job url:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11893932727/job/33165054148

image

log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22ryV%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-nightly-b76c1a8-20241118%5C%22%7D%20%7C%3D%20%60rpc%20timeout%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731975302000%22,%22to%22:%221731975579000%22%7D%7D%7D&schemaVersion=1&orgId=1

如需要profile信息,请联系我

Expected Behavior

No response

Steps to Reproduce

数据量:10亿约330G

出错执行的sql:create index `col3` on big_data_test.table_basic_for_alter_1B(col3);

上一个正常的commit:0079c99cc6e2a244047c41a42d239058ea8d4513(不确定是否必现)

Additional information

No response

sukki37 commented 2 days ago
  1. From the MO logs, the create index 'col3' on big_data_test.table_basic_for_alter_1B(col3) operation failed at 08:15:14.928, but the client only reported the error 3 minutes later. log: https://grafana.ci.matrixorigin.cn/goto/DONZbM7NR?orgId=1

    image
  2. During the period from 08:15:14 to 08:15:45 (when there were no new requests from the client), the CPU of the CN was almost fully utilized. It is necessary to determine what workload the CN was processing during this time. https://grafana.ci.matrixorigin.cn/goto/kS0b-GnHR?orgId=1 https://grafana.ci.matrixorigin.cn/d/cluster-detail-namespaced/cluster-detail-namespaced?orgId=1&var-namespace=mo-big-data-nightly-b76c1a8-20241118&var-account=All&var-interval=%24__auto_interval_interval&var-cluster=.*&var-loki=loki&from=1731975300000&to=1731975579000&viewPanel=4

    image
image image
  1. Based on the error logs, it is highly likely that a network error or packet loss occurred, leading to the data not being successfully sent. This could potentially be related to the CPU being fully utilized.
iamlinjunhong commented 14 hours ago

cpu 忙导致 rpc timeout ,可以 retry,但可能会一直 retry