status-im / infra-eth-cluster

Infrastructure for Status-go fleets
https://github.com/status-im/status-go
0 stars 0 forks source link

Investigate Statusd containers restarted #37

Closed jakubgs closed 3 years ago

jakubgs commented 4 years ago

We had a spike in activity today around 3AM UTC and multiple containers restarting.

jakubgs commented 4 years ago

It doesn't appear like an image update triggered by Watchtower because the container has existed for more than 2 weeks:

admin@mail-02.do-ams3.eth.prod:~ % docker ps
CONTAINER ID        NAMES               IMAGE                              CREATED             STATUS
40553f753762        statusd-mail        statusteam/status-go:deploy-prod   2 weeks ago         Up 8 hours
840421b6fd64        statusd-mail-db     postgres:9.6-alpine                2 weeks ago         Up 11 days

As we can see the PostgreSQL container was not affected.

jakubgs commented 4 years ago

The logs on mail-02.do-ams3.eth.prod look like this:

INFO [10-06|03:01:46.977] received new envelope                    peer=0x40a8cfd21a3a2a45ff6b18c6db3ba5dfe4acd8d4be6c17e84554d648365e7547 hash=0x151c7715a3d94d40de111223a8e0009e50b7d2d9fb0eae60857dc05d5f10fd41
INFO [10-06|03:01:47.025] settings options                         peerID=0x1aad8b5be20513ad5766b3afb1a0e7c40a9b90a9a0f61f21118b05df21fc5e21 Options="{PoWRequirement:0xc012f42810 BloomFilter:[255 255 255 255 255 255 255 255 255 
INFO [10-06|03:01:47.027] handling peer                            id=0x34f766b58ca7f8ea197bfd2ffb377c474acb4b254cff4ad693a5b403e82001df ip=178.128.141.249
WARN [10-06|03:01:47.098] failed to deliver envelopes              peer=891438aea1fca794 error="write tcp 172.17.0.3:43186->35.239.5.162:30305: use of closed network connection"
INFO [10-06|03:01:47.128] settings options                         peerID=0x34f766b58ca7f8ea197bfd2ffb377c474acb4b254cff4ad693a5b403e82001df Options="{PoWRequirement:0xc014608ab8 BloomFilter:[255 255 255 255 255 255 255 255 255 
INFO [10-06|03:01:47.129] message loop                             peer=891438aea1fca794 err="p2p: read or write on closed message pipe"
INFO [10-06|03:01:47.269] received new envelopes                   count=1 peer=0xfd5eed8a28b7ca5d77fb305b52bc1f0ffebbc34b6386334aad5d4c5fd03d839d ip=35.223.240.17
INFO [10-06|03:01:47.271] received new envelope                    peer=0xfd5eed8a28b7ca5d77fb305b52bc1f0ffebbc34b6386334aad5d4c5fd03d839d hash=0x151c7715a3d94d40de111223a8e0009e50b7d2d9fb0eae60857dc05d5f10fd41
WARN [10-06|03:01:47.391] failed to deliver envelopes              peer=98ab721a4762645a error="shutting down"
WARN [10-06|03:01:48.701] oversize message received                peer="[36 186 242 213 107 222 203 98 106 38 94 142 216 14 22 25 92 1 196 149 235 230 195 95 204 20 195 229 60 23 142 7]" size=7233488
INFO [10-06|03:01:48.874] message loop                             peer=24baf2d56bdecb62 err="p2p: read or write on closed message pipe"
WARN [10-06|03:01:48.964] failed to deliver envelopes              peer=24baf2d56bdecb62 error="shutting down"
WARN [10-06|03:01:49.064] oversize message received                peer="[26 173 139 91 226 5 19 173 87 102 179 175 177 160 231 196 10 155 144 169 160 246 31 33 17 139 5 223 33 252 94 33]" size=6677152
WARN [10-06|03:01:49.084] failed to deliver envelopes              peer=1aad8b5be20513ad error="shutting down"
INFO [10-06|03:01:49.087] message loop                             peer=1aad8b5be20513ad err="p2p: read or write on closed message pipe"
INFO [10-06|03:01:49.594] message loop                             peer=34f766b58ca7f8ea err="p2p: read or write on closed message pipe"
INFO [10-06|03:01:49.595] failed to read a message                 peer="[52 247 102 181 140 167 248 234 25 123 253 47 251 55 124 71 74 203 75 37 76 255 74 214 147 165 180 3 232 32 1 223]" error="p2p: read or write on closed mes
WARN [10-06|03:01:49.810] failed to deliver envelopes              peer=34f766b58ca7f8ea error="write tcp 172.17.0.3:45150->178.128.141.249:30305: write: broken pipe"
INFO [10-06|03:01:49.846] handling peer                            id=0xcbff66ca3e3e6f9e21185aa2a7b8a7a46e98cad7f0a6222465bd96be05b82c28 ip=47.75.85.212
INFO [10-06|03:01:51.806] received new envelopes                   count=1 peer=0x0e969556ca4bf2a0266aed93582ea55f01c09f2431fe84d22d2b7acdf2acb592 ip=178.128.141.87
INFO [10-06|03:01:51.816] received new envelope                    peer=0x0e969556ca4bf2a0266aed93582ea55f01c09f2431fe84d22d2b7acdf2acb592 hash=0xe919c0d0af26a95999136e3e47df8295ca7b93cd624e16b37532cc358deddd64
INFO [10-06|03:01:51.842] received new envelopes                   count=1 peer=0x40a8cfd21a3a2a45ff6b18c6db3ba5dfe4acd8d4be6c17e84554d648365e7547 ip=134.209.136.79
INFO [10-06|03:01:51.844] received new envelope                    peer=0x40a8cfd21a3a2a45ff6b18c6db3ba5dfe4acd8d4be6c17e84554d648365e7547 hash=0xe919c0d0af26a95999136e3e47df8295ca7b93cd624e16b37532cc358deddd64
INFO [10-06|03:01:55.330] Status backend initialized               backend=geth version=0.61.2 commit=845980535
INFO [10-06|03:01:55.365] LES protocol is disabled                 package=status-go/node
INFO [10-06|03:01:55.365] Status service api is disabled           package=status-go/node
INFO [10-06|03:01:55.366] Starting peer-to-peer node               instance=Statusd/v0.61.2/linux-amd64/go1.13.8
INFO [10-06|03:01:55.370] Connecting to postgres database 
INFO [10-06|03:01:58.199] Connected to postgres database 
INFO [10-06|03:01:58.199] Starting cleaning envelopes              period=1h0m0s retention=720h0m0s
INFO [10-06|03:01:58.276] Connecting to postgres database 
INFO [10-06|03:02:04.919] Connected to postgres database 
INFO [10-06|03:02:04.920] Starting cleaning envelopes              period=1h0m0s retention=720h0m0s
INFO [10-06|03:02:05.023] started whisper v.6.0 
INFO [10-06|03:02:05.031] New local node record                    seq=76 id=7d646a1b74af5f29 ip=127.0.0.1 udp=0 tcp=30504
INFO [10-06|03:02:05.033] Started P2P networking                   self="enode://7aa648d6e855950b2e3d3bf220c496e0cae4adfddef3e1e6062e6b177aec93bc6cdcf1282cb40d1656932ebfdd565729da440368d7c4da7dbd4d004b1ac02bf8@127.0.0.1:30504?discport=0"
INFO [10-06|03:02:05.028] IPC endpoint opened                      url=/data/geth.ipc
WARN [10-06|03:02:05.039] Sanitizing invalid HTTP read timeout     provided=0s updated=30s
WARN [10-06|03:02:05.040] Sanitizing invalid HTTP write timeout    provided=0s updated=30s
WARN [10-06|03:02:05.040] Sanitizing invalid HTTP idle timeout     provided=0s updated=2m0s
INFO [10-06|03:02:05.043] HTTP endpoint opened                     url=http://0.0.0.0:8545 cors= vhosts=localhost
INFO [10-06|03:02:05.043] setup a Whisper-Waku bridge successfully 
INFO [10-06|03:02:05.044] Handlers registered                      package=status-go/api.GethStatusBackend
INFO [10-06|03:02:05.044] Using AdvertiseAddr for rendezvous       package=status-go/node.StatusNode addr=178.128.142.26
INFO [10-06|03:02:05.047] UDP listener up                          net=enode://7aa648d6e855950b2e3d3bf220c496e0cae4adfddef3e1e6062e6b177aec93bc6cdcf1282cb40d1656932ebfdd565729da440368d7c4da7dbd4d004b1ac02bf8@[::]:30504
INFO [10-06|03:02:05.092] Difference with ntp servers              offset=-9.702438ms

It appears to have died at 03:01:51.844 and then started again at 03:01:55.330 four seconds later.

jakubgs commented 4 years ago

This has not happened on all hosts:

 > ansible mail -m shell -o -a 'grep "Status backend initialized" /var/log/docker/statusd-mail/docker.log > /dev/null && echo PRESENT || echo ABSENT'

mail-02.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) PRESENT
mail-01.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) PRESENT
mail-03.do-ams3.eth.prod | CHANGED | rc=0 | (stdout) ABSENT
mail-01.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) ABSENT
mail-03.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) PRESENT
mail-02.gc-us-central1-a.eth.prod | CHANGED | rc=0 | (stdout) PRESENT
mail-01.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) PRESENT
mail-03.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) ABSENT
mail-02.ac-cn-hongkong-c.eth.prod | CHANGED | rc=0 | (stdout) ABSENT
jakubgs commented 4 years ago

Here are some graphs from mail-01.do-ams3.eth.prod:

20201006_mail_01_load 20201006_mail_01_net 20201006_mail_01_mem

Container CPU and Memory usage clearly show a spike for the statusd-mail container:

20201006_mail_01_cont_cpu 20201006_mail_01_cont_mem

https://grafana.status.im/d/QCTZ8-Vmk/single-host-dashboard?orgId=1&refresh=10s&var-host=mail-01.do-ams3.eth.prod&from=1601939365572&to=1601982565572

jakubgs commented 4 years ago

And it appears it was killed due to memory usage on mail-02.do-ams3.eth.prod:

Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685433] statusd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685437] CPU: 0 PID: 169572 Comm: statusd Not tainted 5.4.0-47-generic #51-Ubuntu
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685438] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685439] Call Trace:
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685448]  dump_stack+0x6d/0x9a
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685465]  dump_header+0x4f/0x1eb
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685467]  oom_kill_process.cold+0xb/0x10
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685468]  out_of_memory.part.0+0x1df/0x3d0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685470]  out_of_memory+0x6d/0xd0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685473]  mem_cgroup_out_of_memory+0xbd/0xe0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685475]  try_charge+0x77c/0x810
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685480]  ? __alloc_pages_nodemask+0x173/0x320
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685482]  mem_cgroup_try_charge+0x71/0x190
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685484]  mem_cgroup_try_charge_delay+0x22/0x50
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685487]  do_anonymous_page+0x13c/0x650
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685489]  __handle_mm_fault+0x760/0x7a0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685491]  handle_mm_fault+0xca/0x200
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685499]  do_user_addr_fault+0x1f9/0x450
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685501]  __do_page_fault+0x58/0x90
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685505]  ? exit_to_usermode_loop+0x8f/0x160
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685507]  do_page_fault+0x2c/0xe0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685511]  do_async_page_fault+0x39/0x70
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685516]  async_page_fault+0x34/0x40
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685518] RIP: 0033:0x481f03
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685520] Code: 00 48 81 eb 00 01 00 00 48 81 c7 00 01 00 00 48 81 fb 00 01 00 00 73 82 e9 07 ff ff ff c5 fd ef c0 48 81 fb 00 00 00 02 73 46 <c5> fe 7f 07 c5 fe 7f 47 20 c5 fe 7f 47 40 c5 fe 7f 47 60 48 81 eb
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685521] RSP: 002b:00007ff13eb32970 EFLAGS: 00010202
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685523] RAX: 0000000000000000 RBX: 0000000000003000 RCX: 000000c019ecc000
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685523] RDX: 00000000026bc0c0 RSI: 0000000000000040 RDI: 000000c019f51000
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685524] RBP: 00007ff13eb329b8 R08: 000000c019ecc000 R09: 0000000000000044
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685525] R10: 00007ff13db61ea8 R11: 0000000000000fa9 R12: 0000000000000043
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685525] R13: 0000000000000000 R14: 0000000001b0c920 R15: 0000000000000000
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685540] memory: usage 1423360kB, limit 1423360kB, failcnt 30678
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685541] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685541] kmem: usage 4960kB, limit 9007199254740988kB, failcnt 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.685542] Memory cgroup stats for /docker/40553f753762b847898d45868683a2b70939f3c1b2c8bd60b677195489bbca45:
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] anon 1452310528
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] file 163840
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] kernel_stack 184320
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] slab 1536000
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] sock 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] shmem 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] file_mapped 135168
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] file_dirty 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] file_writeback 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] anon_thp 41943040
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] inactive_anon 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] active_anon 1452216320
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] inactive_file 98304
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] active_file 45056
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] unevictable 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] slab_reclaimable 974848
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] slab_unreclaimable 561152
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgfault 1205523
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgmajfault 10098
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] workingset_refault 199386
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] workingset_activate 107943
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] workingset_nodereclaim 0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgrefill 1355261
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgscan 3065911
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgsteal 1043413
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgactivate 1388079
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686109] pgdeactivate 1256287
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686110] Tasks state (memory values in pages):
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686111] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686115] [ 169540] 100000 169540   427003   354338  3305472        0             0 statusd
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686117] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=40553f753762b847898d45868683a2b70939f3c1b2c8bd60b677195489bbca45,mems_allowed=0,oom_memcg=/docker/40553f753762b847898d45868683a2b70939f3c1b2c8bd60b677195489bbca45,task_memcg=/docker/40553f753762b847898d45868683a2b70939f3c1b2c8bd60b677195489bbca45,task=statusd,pid=169540,uid=100000
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.686148] Memory cgroup out of memory: Killed process 169540 (statusd) total-vm:1708012kB, anon-rss:1417352kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:3228kB oom_score_adj:0
Oct  6 03:01:53 mail-02.do-ams3.eth.prod kernel: [1000887.807490] oom_reaper: reaped process 169540 (statusd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
jakubgs commented 4 years ago

Looks like we had a major spike in envelopes at that time:

20201006_statusd_prod_errors

20201006_statusd_prod_spike

20201006_statusd_prod_in_traffic

https://grafana.status.im/d/wV77E-4mz/statusd-metrics?orgId=1&from=1601939917298&to=1601983117299&var-host=mail.*&var-fleet=prod&var-dc=ac-cn-hongkong-c&var-dc=do-ams3&var-dc=gc-us-central1-a

jakubgs commented 4 years ago

This is essentially a doubling of our normal unique envelope rates.

jakubgs commented 4 years ago

It's likely that this was an attempt at a spam attack. @cammellos suggested it might have been due to notifications but unlikely.

jakubgs commented 3 years ago

Closing since this is quite old and we didn't see a re-occurence.