matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.71k stars 265 forks source link

[Bug]: [big-data-test] insert into select performance degradation. #17143

Open Ariznawlll opened 1 week ago

Ariznawlll commented 1 week ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

9f1c2df

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

image

上周正常的job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9581145106 commit:2440b884e23da215733c181b458cefd1ccb32126 <img width="1371" alt="image" src="https://github.com/matrixorigin/matrixone/assets/108530700/97fe45fa-9944-458d-a520-b8e9625802bc"> log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22N7w%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240619%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718755200000%22,%22to%22:%221718927999000%22%7D%7D%7D&schemaVersion=1&orgId=1

profile : https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22N7w%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240619%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22process_cpu:cpu:nanoseconds:cpu:nanoseconds%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221718816160000%22,%22to%22:%221718819174000%22%7D%7D%7D&schemaVersion=1&orgId=1

性能退化的job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9640573308 commit: 9f1c2dfa908bc08f58cbf59dcb14b4f3bdd16a1c

image

log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22N7w%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240624%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221719229434000%22,%22to%22:%221719238361000%22%7D%7D%7D&schemaVersion=1&orgId=1

profile: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22N7w%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240624%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22process_cpu:cpu:nanoseconds:cpu:nanoseconds%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221719229434000%22,%22to%22:%221719238361000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

trigger big-data-test on tke

Additional information

No response

jensenojs commented 1 week ago

性能退化的job在跑insert into table_with_com_pk_index_for_insert_1B的时候, 从2024 11:43:54开始, 到2024 14:12:41结束

它的CPU截图如下所示 :

image

作为对比, 之前的jobs的CPU截图如下所示 :

image

gc时间暴涨

jensenojs commented 1 week ago
9f1c2dfa9 fix delete page error and some typos (#17087)  <- report
8eb9e709b [opt] : remove output operator after fuzzy filter (#17077)
79e9e4394 Fix typos in README (#16402)
57a9e8776 fileservice: use noop retryer in aws sdk (#17095)
c88c0978c fileservice: more slow logs (#17100)
e520a422a [bug] stats: fix the leak of goroutine (#17071)
72b187e61 fix lock table move failed (#17092)
067da2e8f Fix bugs[dup/ww/data-lost] when unsubscribe table (#17005)
72d4b89e8 [bug] fix ut TestPauseResumeDaemonTask (#17028)
eaf78572f [Cherry-pick] Refactor UnresolvedName (#17040)
9d68af958 Fix memory leak   (#17069)
5a73df781 mo-service: data dir compatibility fixes (#17062)
8d1d0fb68 preallocate transfer page (#17056)
d39bec0ca [opt] : optmize duplicate check memory usage for sql like insert into t1 selct from t2 (#17020)
e36343faa fix: external free (#17074)
eb1f0f578 [enhancement] logtail: change the default send timeout, add logs (#17059)
d6a322621 [refator]move all state type variables to arg.ctr in Operator (#16951)
a014b808c adding logic for prefix_in in case vector items are not unique. (#17041)
c6fa50c4f adjust timeout for orphan transactions (#17044)
46e017801 fix stats bug for fuzzy filter (#17051)
b8ab81b9a add system busy monitor (#17049)
8dbcb5093 debug load local (#17046)
15a2c3aef fix some little bug of stats (#17035)
a19e6d696 fileservice: add event logger and log slow S3.Read (#17021)
1ac74cfb9 change all filter related "or" expr from binary function to multi function (#16992)
3dbabeec9 [enhancement] proxy: clean some logs (#17031)
1e35ff0a9 retry allocate if orphan txn (#17030)
6448ba46a objectio: copy to avoid memory leak (#17029)
9334ffea3 reduce memory allocation by NewWithAnalyze (#17019)
a7f6835f2 logservice: fix memory leak (#17023)
a40cae885 [bug] filter out the CNs with different commit ID. (#17003)
f72a8464a fix snapshot read bug (#17016)
5cba9c51f malloc: various changes and optimizations  (#16990)
8c4f92bfc dashboard: fix fileservice board (#17008)
f223c892c Fix error refers for table meta (#16999)
2440b884e bug fix: load s3 panic (#17009) <- last good
Ariznawlll commented 1 week ago

[0626] 1.2-dev: 2545ce4c726d5c9e64f7a6b8307b5da0e6da950d job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9660414495/job/26646571031

image

profile: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22cg0%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240625%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22memory:inuse_space:bytes:space:bytes%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221719342869000%22,%22to%22:%221719349741000%22%7D%7D%7D&schemaVersion=1&orgId=1

jensenojs commented 1 week ago

[0626] 1.2-dev: 2545ce4 job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9660414495/job/26646571031

image

profile: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22cg0%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240625%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22memory:inuse_space:bytes:space:bytes%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221719342869000%22,%22to%22:%221719349741000%22%7D%7D%7D&schemaVersion=1&orgId=1

从pprof上看和main上性能退化是类似的

image
jensenojs commented 1 week ago

表因是最近改了GOLIMIT的参数, 根因还是在insert into select 的时候内存压力太大了, 导致疯狂gc

继续做

jensenojs commented 6 days ago

等pr合并

jensenojs commented 5 days ago

请魏璐帮忙跑了一下main上合并了fuzzy内存优化相关的代码, 可以看到gc的压力小了, 耗时回到了一小时以内.

https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9698664379/job/26775565464

profile link : https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22cg0%22:%7B%22datasource%22:%22pyroscope%22,%22queries%22:%5B%7B%22groupBy%22:%5B%5D,%22labelSelector%22:%22%7Bnamespace%3D%5C%22mo-big-data-20240627%5C%22%7D%22,%22queryType%22:%22both%22,%22refId%22:%22A%22,%22profileTypeId%22:%22memory:inuse_space:bytes:space:bytes%22,%22datasource%22:%7B%22type%22:%22grafana-pyroscope-datasource%22,%22uid%22:%22pyroscope%22%7D%7D%5D,%22range%22:%7B%22from%22:%221719519314000%22,%22to%22:%221719522812000%22%7D%7D%7D&schemaVersion=1&orgId=1

cpu :

image

inuse-memory :

image
jensenojs commented 5 days ago
image

但其实不能确定是不是由我的优化来fix的, 毕竟gc跟object数量有关系, 但fuzzy的内存消耗是很大头的, 这个跟gc的直接关系可能不大, 除了我的优化以外, 可能有助于修复这个issue的pr有

可能需要 @ouyuanning 远宁哥评估一下