splunk / eventgen

Splunk Event Generator: Eventgen
Apache License 2.0
380 stars 179 forks source link

[BUG] Some process was down in Eventgen standlone because of OOM #298

Closed Yangxulight closed 5 years ago

Yangxulight commented 5 years ago

Describe the bug After running eventgenX(6.5) to generate ES dat for 3 days, some process were down and the generate rate slow down.

To Reproduce

Expected behavior Eventgen process should keep running.

Actual behavior Some Eventgen processes are down.

Screenshots Process status:

bash-5.0# ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 {entrypoint.sh} /bin/bash /sbin/entrypoint.sh standalone
    8 root      0:00 /usr/sbin/sshd
   10 root      0:15 tail -F -n0 /etc/hosts
  213 root      1d05 python bin/conductor2-agent --port 3333
  236 root      7:04 /usr/bin/python -c from conductor2 import agent; agent.monitor_proc() --pid 213 --int
  241 root      1:43 /usr/bin/python -c from conductor2 import worker; worker.run_worker() --type user --p
25812 root      3d00 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27282 root      2h54 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27392 root      1d02 [splunk_eventgen]
27393 root      1d17 [splunk_eventgen]
27397 root      1d03 [splunk_eventgen]
27398 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27399 root      1d00 [splunk_eventgen]
27409 root     23h12 [splunk_eventgen]
27413 root     21h50 [splunk_eventgen]
27414 root      1d14 [splunk_eventgen]
27415 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27416 root     20h41 [splunk_eventgen]
27417 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27431 root      1d11 [splunk_eventgen]
27435 root      1d06 [splunk_eventgen]
27437 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27439 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27440 root      1d08 [splunk_eventgen]
27450 root      1d22 [splunk_eventgen]
27456 root      2d03 [splunk_eventgen]
27458 root      2d13 {splunk_eventgen} /usr/bin/python2 /usr/bin/splunk_eventgen service --role standalone
27459 root      2d09 [splunk_eventgen]
120336root      0:00 bash
121447root      0:00 ps -ef

Dmesg:

[372382.889952] 0 pages hwpoisoned
[372382.889953] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[372382.889961] [  516]     0   516    30397    19082      63       3        0             0 systemd-journal
[372382.889965] [  567]     0   567    25742      176      19       3        0             0 lvmetad
[372382.889967] [  596]     0   596    11001      803      23       3        0         -1000 systemd-udevd
[372382.889971] [  644]   100   644    25081      451      19       3        0             0 systemd-timesyn
[372382.889974] [ 1180]     0  1180     4030      533      12       3        0             0 dhclient
[372382.889978] [ 1351]     0  1351     6932      376      17       3        0             0 cron
[372382.889981] [ 1354]     0  1354     5024      318      14       3        0             0 systemd-logind
[372382.889984] [ 1356]   107  1356    10757      406      25       3        0          -900 dbus-daemon
[372382.889987] [ 1362]     0  1362    68718      516      37       3        0             0 accounts-daemon
[372382.889990] [ 1364]     0  1364     1099      295       8       3        0             0 acpid
[372382.889993] [ 1368]     0  1368     6511      315      18       3        0             0 atd
[372382.889996] [ 1402]   104  1402   146034    13030      91       3        0             0 rsyslogd
[372382.889999] [ 1406]     0  1406   132578      440      26       4        0             0 lxcfs
[372382.890002] [ 1415]     0  1415   289640    27923     142       4        0             0 collectd
[372382.890005] [ 1424]     0  1424   608562     7183     130       7        0             0 containerd
[372382.890008] [ 1430]     0  1430     1305       29       8       3        0             0 iscsid
[372382.890011] [ 1431]     0  1431     1430      881       8       3        0           -17 iscsid
[372382.890014] [ 1435]     0  1435    16378      424      37       3        0         -1000 sshd
[372382.890017] [ 1466]     0  1466     3343       51      11       3        0             0 mdadm
[372382.890020] [ 1511]     0  1511     3664      341      12       3        0             0 agetty
[372382.890023] [ 1512]     0  1512     3618      394      12       3        0             0 agetty
[372382.890026] [ 1548]     0  1548     4970      333      15       3        0             0 irqbalance
[372382.890029] [ 1555]     0  1555    69278      297      39       4        0             0 polkitd
[372382.890033] [ 1581] 65534  1581     5013      540      15       3        0             0 docker-shim-col
[372382.890036] [ 1963]   114  1963   332686     7674      92       5        0             0 named
[372382.890039] [ 2771]     0  2771   397972     4358      82       6        0             0 amazon-ssm-agen
[372382.890043] [ 3055]     0  3055   859685    47357     880       8        0          -999 dockerd
[372382.890047] [ 3181] 65534  3181   258734     2188      59       6        0             0 collectd-docker
[372382.890050] [10177]     0 10177     2089      303      11       5        0          -500 docker-proxy
[372382.890052] [10201]     0 10201     2089      301      11       5        0          -500 docker-proxy
[372382.890056] [10213]     0 10213     2417      303      11       5        0          -500 docker-proxy
[372382.890059] [10219]     0 10219     2947      533      10       5        0          -999 containerd-shim
[372382.890062] [10237]     0 10237    16594     3945      36       5        0          -600 ucp-agent
[372382.890065] [10347]     0 10347     2331      168       9       5        0          -999 containerd-shim
[372382.890068] [10351]     0 10351     2331      250       9       5        0          -999 containerd-shim
[372382.890071] [10381]     0 10381   783098     4942     164       8        0          -999 kube-proxy
[372382.890076] [10382]     0 10382   995469    14840     205       8        0          -999 kubelet
[372382.890080] [10754]     0 10754     2683      188      11       5        0          -999 containerd-shim
[372382.890083] [10777]     0 10777      256        1       4       2        0          -998 pause
[372382.890086] [29222]     0 29222     2947      272      10       5        0          -999 containerd-shim
[372382.890089] [29239]     0 29239      553       51       6       3        0             0 entrypoint.sh
[372382.890093] [29261]     0 29261     1081      132       5       3        0             0 sshd
[372382.890096] [29263]     0 29263      388       12       4       3        0             0 tail
[372382.890099] [35331]     0 35331    20898    13460      45       5        0             0 python
[372382.890101] [37431]     0 37431    16839     9744      37       3        0             0 python
[372382.890104] [37436]     0 37436    19564    11220      43       3        0             0 python
[372382.890107] [71469]     0 71469   302873    17125     595       4        0             0 splunk_eventgen
[372382.890111] [73203]     0 73203   277344    15646     550       4        0             0 splunk_eventgen
[372382.890114] [73319]     0 73319  4678942  4385898    9261      21        0             0 splunk_eventgen
[372382.890118] [73336]     0 73336  4669751  4368921    9258      21        0             0 splunk_eventgen
[372382.890121] [73338]     0 73338  4696118  4399993    9297      21        0             0 splunk_eventgen
[372382.890124] [73358]     0 73358  4707980  4418621    9276      21        0             0 splunk_eventgen
[372382.890127] [73360]     0 73360  4679864  4391365    9279      21        0             0 splunk_eventgen
[372382.890130] [73379]     0 73379  4698061  4406017    9313      21        0             0 splunk_eventgen
[372382.890133] [73380]     0 73380  4722276  4432904    9304      21        0             0 splunk_eventgen
[372382.890137] [55660]     0 55660   402801     2759      87       8        0          -900 snapd
[372382.890141] [105442]     0 105442    25791      152       7       4        0             0 start-amazon-cl
[372382.890144] [105466]     0 105466   479682     3405      95       5        0             0 amazon-cloudwat
[372382.890147] [95852]     0 95852     2947      740      10       5        0          -999 containerd-shim
[372382.890151] [95876]     0 95876      197        9       5       3        0           999 runsvdir
[372382.890154] [95909]     0 95909     2683      240      10       5        0          -999 containerd-shim
[372382.890157] [95952]     0 95952    14396     2814      32       5        0             0 ucp-agent
[372382.890161] [96192]     0 96192     2683      226      11       5        0          -999 containerd-shim
[372382.890164] [96202]     0 96202     2331      288       9       5        0          -999 containerd-shim
[372382.890167] [96245]     0 96245      551       55       6       3        0             0 entrypoint.sh
[372382.890171] [96281]     0 96281      192        9       6       3        0           999 runsv
[372382.890173] [96282]     0 96282      192        8       6       3        0           999 runsv
[372382.890177] [96283]     0 96283      192        9       5       3        0           999 runsv
[372382.890180] [39296]     0 39296      388        1       3       3        0             0 sleep
[372382.890183] [86152]     0 86152    33514     1037      26       5        0          -999 runc
[372382.890186] [86176]     0 86176    32834      935      26       6        0          -999 runc
[372382.890189] [86218] 65534 86218     5013      360      14       3        0             0 docker-shim-col
[372382.890192] [86219] 65534 86219     9021      330      21       3        0             0 ps
[372382.890196] [86220] 65534 86220     3236      339      12       3        0             0 grep
[372382.890199] [86221] 65534 86221     2796       17       9       3        0             0 grep
[372382.890202] [86222] 65534 86222       90        1       5       3        0             0 wc
[372382.890205] [86276]     0 86276     2737       34       8       3        0             0 systemd-cgroups
[372382.890209] [86278]     0 86278      379        9       6       3        0           999 run
[372382.890212] Out of memory: Kill process 96281 (runsv) score 998 or sacrifice child
[372382.896343] Killed process 96281 (runsv) total-vm:768kB, anon-rss:36kB, file-rss:0kB
[372382.938300] Out of memory: Kill process 86278 (run) score 998 or sacrifice child
[372382.946301] Killed process 86278 (run) total-vm:1516kB, anon-rss:36kB, file-rss:0kB
[372383.154907] Out of memory: Kill process 95876 (runsvdir) score 998 or sacrifice child
[372383.162453] Killed process 96282 (runsv) total-vm:768kB, anon-rss:32kB, file-rss:0kB
[372383.183695] Out of memory: Kill process 95876 (runsvdir) score 998 or sacrifice child
[372383.190269] Killed process 96283 (runsv) total-vm:768kB, anon-rss:36kB, file-rss:0kB
[372383.270034] Out of memory: Kill process 95876 (runsvdir) score 998 or sacrifice child
[372383.277486] Killed process 86281 (run) total-vm:1516kB, anon-rss:32kB, file-rss:0kB
[372383.299728] Out of memory: Kill process 95876 (runsvdir) score 998 or sacrifice child
[372383.305859] Killed process 95876 (runsvdir) total-vm:788kB, anon-rss:36kB, file-rss:0kB
[372383.399302] Out of memory: Kill process 73380 (splunk_eventgen) score 141 or sacrifice child
[372383.405931] Killed process 73380 (splunk_eventgen) total-vm:18889104kB, anon-rss:17731616kB, file-rss:0kB
[372390.106566] TCP: request_sock_TCP: Possible SYN flooding on port 179. Sending cookies.  Check SNMP counters

Sample files and eventgen.conf file http://10.66.136.173/web/egx/download/bundle/es_benchmark.tgz

Additional context Add any other context about the problem here.

li-wu commented 5 years ago

Tested with a version with some changes. After running for three days, 14 processes are down and only 6 are running and using about 18GB memory for each. The throughput per indexer(test env has three indexes) drops from 2MB/s to 1.2MB/s for perdayvolume=600GB.