matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.76k stars 274 forks source link

[Bug]: OOM problem, plan2 #2795

Closed sukki37 closed 2 years ago

sukki37 commented 2 years ago

Is there an existing issue for the same bug?

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93): 329724b
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

during bvt, mo-server will be killed by the os because of OOM and the last SQL is select space(4294967295);

[  486.395995] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  486.395998] [    191]     0   191    17477      977   106496       72          -250 systemd-journal
[  486.396002] [    226]     0   226     4917      846    61440      146         -1000 systemd-udevd
[  486.396004] [    274]     0   274      624      132    40960       11             0 bpfilter_umh
[  486.396006] [    303]     0   303     1032      [67](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:68)8    45056       26             0 hv_kvp_daemon
[  486.396008] [    402]     0   402    72116     65[68](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:69)   110592        0         -1000 multipathd
[  486.396011] [    455]     0   455     2076      594    53248      759             0 haveged
[  486.396013] [    537]   100   537     8729      842    77824       62             0 systemd-network
[  486.396014] [    539]   101   539     5970     1388    86016      342             0 systemd-resolve
[  486.396016] [    653]     0   653    66406      746   106496       68             0 accounts-daemon
[  486.396018] [    665]     0   665     2139      543    57344       13             0 cron
[  486.396020] [    667]   112   667     1207      527    53248       15             0 chronyd
[  486.396021] [    668]   112   668     1174       41    53248        8             0 chronyd
[  486.396023] [    6[69](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:70)]   103   669     1900      629    49152       29          -900 dbus-daemon
[  486.396025] [    675]     0   675    22506      [70](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:71)5    61440       10             0 irqbalance
[  486.396027] [    677]     0   677     7480     2292    94208      572             0 networkd-dispat
[  486.396029] [    679]     0   679    73[71](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:72)8     3740   266240      704             0 php-fpm7.4
[  486.396031] [    681]     0   681    73747     3219   270336     1074             0 php-fpm8.0
[  486.396033] [    683]     0   683    73962     3597   258048      799             0 php-fpm8.1
[  486.396035] [    684]     0   684    65254      951    94208       56             0 polkitd
[  486.396036] [    685]   104   685    58173      663    86016      780             0 rsyslogd
[  486.396038] [    686]     0   686   937381     4816   507904     64[72](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:73)             0 provisioner
[  486.396041] [    690]     0   690   239061     4120   266240     1618          -900 snapd
[  486.396042] [    693]     0   693     4115      772    69632       30             0 systemd-logind
[  486.396044] [    701]     0   701   108966     1582   139264      278             0 udisksd
[  486.396046] [    703]     0   703     [73](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:74)70     1543    94208     2223             0 python3
[  486.396048] [    704]     0   704      953      514    45056        4             0 atd
[  486.396050] [    706]     0   706   373735     4032   286720     2762          -999 containerd
[  486.396052] [    735]     0   735     3048      787    69632       52         -1000 sshd
[  486.396054] [    806]     0   806    86809     1041   110592       53             0 ModemManager
[  486.396056] [    808]     0   808     1840      432    49152        8             0 agetty
[  486.396057] [    810]    33   810    73358     3135   204800     33[74](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:75)             0 mono
[  486.396059] [    820]     0   820     1459      416    49152        0             0 agetty
[  486.396061] [    867]     0   867   110691     5291   143360      317             0 python3
[  486.396063] [    879]    33   879    740[75](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:76)     2498   237568      787             0 php-fpm8.1
[  486.396065] [    880]    33   880    74075     2463   237568      822             0 php-fpm8.1
[  486.396067] [    885]    33   885    73818     2444   258048      723             0 php-fpm7.4
[  486.396068] [    886]    33   886    73818     2469   258048      698             0 php-fpm7.4
[  486.396071] [    917]    33   917    73843     2171   245[76](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:77)0     1069             0 php-fpm8.0
[  486.396072] [    918]    33   918    73843     2172   245760     1068             0 php-fpm8.0
[  486.396074] [    939]     0   939   341023     5455   356352     3259          -500 dockerd
[  486.396076] [   1494]     0  1494   321102     1891   167936      640             0 provjobd
[  486.3960[78](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:79)] [   1502]  1001  1502   918214     2850   503[80](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:81)8    10911             0 Runner.Listener
[  486.396080] [   1518]  1001  1518   925196      875   557056    16361             0 Runner.Worker
[  486.3960[81](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:82)] [   6512]  1001  6512  3033958  1572644 20443136   924640           500 mo-server
[  486.3960[83](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:84)] [   6622]  1001  6622     2176      449    61440       73           500 bash
[  486.3960[85](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:86)] [   6653]  1001  6653     2210      474    57344       88           500 bash
[  4[86](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:87).3960[87](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:88)] [   6657]  1001  6657   811943      462   811008    69652           500 java
[  486.3960[89](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:90)] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/runner-provisioner.service,task=mo-server,pid=6512,uid=1001
[  486.396124] Out of memory: Killed process 6512 (mo-server) total-vm:12135832kB, anon-rss:62[90](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:91)576kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:19[96](https://github.com/sukki37/matrixone/runs/6744546416?check_suite_focus=true#step:10:97)4kB oom_score_adj:500

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

sukki37 commented 2 years ago

Additional Information, running bvt in our hosted server, and the memory usage of mo-server is recorded(while true; do ps -e -o 'pid,comm,args,pcpu,rsz,vsz,stime,user,uid' | grep 740300 | grep -v 'grep' | sort -nrk5; sleep 1; done). we can find that the maximum memory usage during bvt test is up to 22G which is somehow unreasonable.

 740300 mo-server       ./mo-server system_vars_con  5.4 65572 2242216 00:34 ubuntu   1000
 740300 mo-server       ./mo-server system_vars_con  5.4 65572 2242216 00:34 ubuntu   1000
 740300 mo-server       ./mo-server system_vars_con  5.6 251192 2660544 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  6.0 274012 2660800 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  6.3 554732 3269668 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  6.6 462244 3269668 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  7.0 396000 3269668 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  7.4 691064 3269668 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  7.7 489880 3269924 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  8.0 546816 3270180 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  8.4 656612 3270180 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  8.7 654332 3270180 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  9.0 737572 3270180 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  9.3 759428 3405740 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  9.6 968140 3406060 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con  9.9 951752 3406060 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con 10.2 995488 3406060 00:34 ubuntu  1000
 740300 mo-server       ./mo-server system_vars_con 10.6 1150892 3541364 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 10.8 1115632 3541364 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 11.0 1436228 3811972 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 11.1 1436484 3811972 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 11.2 1581544 4082964 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 11.5 1800360 4082964 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 11.8 1861472 4285920 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 12.1 1893928 4285920 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 12.3 1899880 4285920 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 12.5 1974584 4285920 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 12.7 2370228 8683300 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 12.9 5723192 8683300 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 13.2 6348292 8683300 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 13.4 6348292 8683300 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 13.8 7632324 13013028 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 14.3 9602500 13013028 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 14.7 11566756 17342756 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 14.9 13547172 17342756 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 15.1 15007712 17342756 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 15.1 15007712 17342756 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 15.3 15646784 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 15.6 16281668 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 15.8 16760828 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 16.0 17291220 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 16.2 17653616 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 16.4 18106240 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 16.6 18515304 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 16.9 18948000 21672484 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 17.1 19512712 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 17.3 20072856 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 17.5 20655956 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 17.7 21251456 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 17.9 21845484 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 18.1 22440804 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 18.3 23005500 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 18.5 23665100 26069864 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 18.8 21593552 27152296 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 19.0 18884048 27152296 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 19.3 4774012 27152296 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 19.6 5105220 27152296 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 19.7 5577816 27152296 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 19.9 2817880 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 20.2 2849000 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 20.5 3035292 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 20.7 3071540 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 20.9 3228800 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.2 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.2 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.1 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.1 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.1 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.0 3330176 27152328 00:34 ubuntu 1000
 740300 mo-server       ./mo-server system_vars_con 21.0 3330176 27152328 00:34 ubuntu 1000
nnsgmsone commented 2 years ago

aoe pprof result:

(pprof) top
Showing nodes accounting for 98512.61MB, 97.47% of 101065.36MB total
Dropped 1238 nodes (cum <= 505.33MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
95659.92MB 94.65% 94.65% 95662.42MB 94.65%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/engine.(*worker).alloc
 1512.68MB  1.50% 96.15%  2003.56MB  1.98%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/storage/common.(*Mempool).Alloc
  807.85MB   0.8% 96.95%   807.85MB   0.8%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/storage/container/vector.NewStrVector (inline)
  528.66MB  0.52% 97.47%   983.54MB  0.97%  reflect.Select
       1MB 0.00099% 97.47%  1217.25MB  1.20%  github.com/matrixorigin/matrixcube/raftstore.(*stateMachine).execWriteRequest
    0.50MB 0.00049% 97.47%  1715.27MB  1.70%  github.com/matrixorigin/matrixcube/raftstore.(*replicaCreator).maybeInitReplica.func1
    0.50MB 0.00049% 97.47%   784.52MB  0.78%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
    0.50MB 0.00049% 97.47%  1291.33MB  1.28%  github.com/matrixorigin/matrixcube/raftstore.(*stateMachine).applyCommittedEntries
    0.50MB 0.00049% 97.47%   785.52MB  0.78%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
    0.50MB 0.00049% 97.47%   780.30MB  0.77%  github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).Loop
nnsgmsone commented 2 years ago

tae pprof result:

(pprof) top
Showing nodes accounting for 59378.87MB, 99.24% of 59834.97MB total
Dropped 419 nodes (cum <= 299.17MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
54828.03MB 91.63% 91.63% 54832.03MB 91.64%  github.com/matrixorigin/matrixone/pkg/vm/engine/tae/moengine.newReader
    4096MB  6.85% 98.48%     4096MB  6.85%  github.com/matrixorigin/matrixone/pkg/sql/plan2/function/builtin/unary.SpaceInt64
  448.34MB  0.75% 99.23%   448.34MB  0.75%  github.com/matrixorigin/matrixone/pkg/vm/engine/tae/common.init.0.func2
    5.50MB 0.0092% 99.24% 54776.65MB 91.55%  github.com/matrixorigin/matrixone/pkg/sql/compile2.(*Scope).ParallelRun
    0.50MB 0.00084% 99.24%   305.22MB  0.51%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
    0.50MB 0.00084% 99.24%   446.89MB  0.75%  github.com/matrixorigin/matrixone/pkg/vm/engine/tae/container/vector.NewStdVector
         0     0% 99.24%   304.22MB  0.51%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
         0     0% 99.24%   309.57MB  0.52%  github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).Loop
         0     0% 99.24%  4096.50MB  6.85%  github.com/matrixorigin/matrixone/pkg/sql/colexec2.EvalExpr
         0     0% 99.24%  4096.50MB  6.85%  github.com/matrixorigin/matrixone/pkg/sql/colexec2/projection.Call
LeftHandCold commented 2 years ago

aoe pprof result:

(pprof) top
Showing nodes accounting for 98512.61MB, 97.47% of 101065.36MB total
Dropped 1238 nodes (cum <= 505.33MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
95659.92MB 94.65% 94.65% 95662.42MB 94.65%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/engine.(*worker).alloc
 1512.68MB  1.50% 96.15%  2003.56MB  1.98%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/storage/common.(*Mempool).Alloc
  807.85MB   0.8% 96.95%   807.85MB   0.8%  github.com/matrixorigin/matrixone/pkg/vm/engine/aoe/storage/container/vector.NewStrVector (inline)
  528.66MB  0.52% 97.47%   983.54MB  0.97%  reflect.Select
       1MB 0.00099% 97.47%  1217.25MB  1.20%  github.com/matrixorigin/matrixcube/raftstore.(*stateMachine).execWriteRequest
    0.50MB 0.00049% 97.47%  1715.27MB  1.70%  github.com/matrixorigin/matrixcube/raftstore.(*replicaCreator).maybeInitReplica.func1
    0.50MB 0.00049% 97.47%   784.52MB  0.78%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
    0.50MB 0.00049% 97.47%  1291.33MB  1.28%  github.com/matrixorigin/matrixcube/raftstore.(*stateMachine).applyCommittedEntries
    0.50MB 0.00049% 97.47%   785.52MB  0.78%  github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
    0.50MB 0.00049% 97.47%   780.30MB  0.77%  github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).Loop

https://github.com/matrixorigin/matrixone/blob/main/cmd/db-server/debug.go#L50 allocs : samples of all memory allocations during program execution. Not a sampling of memory allocations for the currently active object.Reader has no memory leaks. The root cause of this problem is the "space" function.

broccoliSpicy commented 2 years ago

fixed