Traffic spike causes memory issues and timeouts

jakubgs commented 2 years ago

Some of our hosts are seeing traffic spikes that lead to a spike in memory usage and CPU usage and then unavailability:

The host then appears as unresponsive and timing out when connections are attempted:

Error occurred: Failed: ERROR[01-18\|13:42:45.436] Error detected while waiting for static peer to be added package=status-go/cmd/node-canary error="wait for peer: timeout"

jakubgs commented 2 years ago

Clear spike in SWAP usage and load:

jakubgs commented 2 years ago

Nodes that have been affected by this:

node-05.gc-us-central1-a.eth.prod
node-07.gc-us-central1-a.eth.prod
node-03.ac-cn-hongkong-c.eth.prod
node-07.ac-cn-hongkong-c.eth.prod

So far only relay nodes, no history nodes.

jakubgs commented 2 years ago

I see that on node-07.ac-cn-hongkong-c.eth.prod the node process was killed due to memory usage:

[1116909.102556] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,mems_allowed=0,oom_memcg=/docker/86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,task_memcg=/docker/86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,task=statusd,pid=1948,uid=100000
[1116909.102618] Memory cgroup out of memory: Killed process 1948 (statusd) total-vm:8276308kB, anon-rss:4126008kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:15636kB oom_score_adj:0
[1116909.729559] oom_reaper: reaped process 1948 (statusd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Which cause the container to restart and recover.

jakubgs commented 2 years ago

Other traffic spikes can be seen:

But they don't all turn into memory usage spikes:

jakubgs commented 2 years ago

Numbers of peers connected to the node appears to drop during the spike, then recover:

jakubgs commented 2 years ago

We can see that the main traffic spike also causes an iowait spike:

And we can see blocked processes:

jakubgs commented 2 years ago

And a nice spike in page faults and swap usage by the container:

jakubgs commented 2 years ago

What's interesting is that the container is limited to 4.4GB of memory, but never reaches that:

It appears to start SWAPing around 2.4GB used.

jakubgs commented 2 years ago

Looks like we had the same thing happen on mail-02.gc-us-central1-a.eth.prod:

[1208334.768786] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/docker/3c4c07aabcb120555b917167936109662b1e3d6b63703e0be65282918b709047,task_memcg=/docker/3c4c07aabcb120555b917167936109662b1e3d6b63703e0be65282918b709047,task=statusd,pid=1997,uid=100000
[1208334.768813] Memory cgroup out of memory: Killed process 1997 (statusd) total-vm:4728752kB, anon-rss:2545876kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:9120kB oom_score_adj:0
[1208335.145990] oom_reaper: reaped process 1997 (statusd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So it's not just relay nodes.

jakubgs commented 2 years ago

Looks like we got hit again, 5 hosts at the same time:

mail-01.do-ams3.eth.prod
mail-02.do-ams3.eth.prod
node-06.ac-cn-hongkong-c.eth.prod
node-07.ac-cn-hongkong-c.eth.prod
node-08.ac-cn-hongkong-c.eth.prod

All experienced the same traffic spike started at the same time:

jakubgs commented 2 years ago

Wait, it's not just select hosts, other hosts saw a spike in traffic too, like node-03.ac-cn-hongkong-c.eth.prod:

But it didn't kill the node:

jakubgs commented 2 years ago

Yup, pretty much all hosts had the network traffic spike:

But only some fail, probably because the were not restarted recently and already had high memory usage.

jakubgs commented 2 years ago

We can see a bit of an increase in envelopes, but not a huge one:

Nothing compared to the network traffic.

jakubgs commented 2 years ago

But a much clearer spike in p2p traffic:

So it's not envelopes, but big envelopes, or just high volume p2p traffic.

jakubgs commented 2 years ago

And a very big spike in receiving expired envelopes:

Not sure what that means though.

jakubgs commented 2 years ago

I graphed the waku_envelopes_size_bytes_bucket histogram and I'm not seeing a spike in bigger envelopes:

Which means we're probably not being spammed by big envelopes.

jakubgs commented 2 years ago

If we go by p2p inbound traffic we can see there have been 4 spikes in the last 2 days, not regular:

https://grafana.infra.status.im/d/wV77E-4mz/statusd-metrics?orgId=1&from=1642514942076&to=1642693350760&viewPanel=16

jakubgs commented 2 years ago

But today was the first really big spike in envelope cache failures:

And matching spike in received envelopes:

But not new envelopes, as we can see in graph in https://github.com/status-im/infra-eth-cluster/issues/46#issuecomment-1017618136.

jakubgs commented 2 years ago

@Samyoul do you have any thoughts about this? Can you think of any reason this might not be caused by external events?

jakubgs commented 2 years ago

Today we had simultaneous failure of 4 production nodes:

node-05.do-ams3.eth.prod
mail-01.do-ams3.eth.prod
node-04.ac-cn-hongkong-c.eth.prod
node-08.gc-us-central1-a.eth.prod

Which all were heavily SWAPing when I checked them due to Consul healthcheck failures.

jakubgs commented 2 years ago

We see gradual CPU load growth, and sudden spike in memory and SWAP usage until there's none left:

jakubgs commented 2 years ago

We also see a noticeable raise in allocated sockets right before the memory usage spike appears:

jakubgs commented 2 years ago

The network traffic is quite irregular, but there's no clear spike at the time:

jakubgs commented 2 years ago

If we look at a long term pattern across 7 days we see gradual increase in memory usage and then big spike:

Though counts of tcp socket states does not correlate with this growth:

jakubgs commented 2 years ago

@Samyoul @cammellos any ideas? If we want to tackle this by increasing host size it will cost quite a lot, and I'm not sure if the issue actually is just a scaling issue, since out numbers of peers have not changed much for the past 2 months.

jakubgs commented 2 years ago

Although, indeed, if we look closed on mail-03.do-ams3.eth.prod peer metrics there was a spike in number of peers:

jakubgs commented 2 years ago

At the time we had a clear spike in envelope cache failures:

jakubgs commented 2 years ago

But if we look closer there is a spike in mobile peers:

Which appears to be periodic:

And 12:30 appears to be the time it occurs. What is this? Some kind of testing?

jakubgs commented 2 years ago

What's weird is that the spikes come from 0.67.0 version:

At least that's how they are advertised.

jakubgs commented 2 years ago

Which is quite old, from 23rd of December 2020: https://github.com/status-im/status-go/releases/tag/v0.67.0

jakubgs commented 2 years ago

And considering the gradually increasing scale of the spike this might be malicious probing by some script kiddie.

jakubgs commented 2 years ago

One idea suggested by @cammellos was that it might be an issue with the canary, but that's different architecture and version:

admin@node-01.do-ams3.canary.hq:/docker/cabot/repos/status-go % git status
HEAD detached at v0.63.12
nothing to commit, working tree clean

jakubgs commented 2 years ago

I collected all currently connected peers using 0.67.0 version with:

ansible mail,whisper -a '/docker/statusd-mail/rpc.sh admin_peers | jq -c ".result[] | { enode, name }" | grep "0.67.0"'

And this is what I got, 38 unique enodes: https://gist.github.com/jakubgs/a898c9d527a59c848b569ae407fe96b5 But if I ignore peer IDs and ports I get only 26 IP addresses:

106.57.73.140
106.61.114.59
106.61.135.1
107.148.202.41
112.96.106.134
113.102.233.45
117.136.72.201
117.136.72.31
154.84.1.121
154.84.1.16
161.35.191.40
182.97.219.6
183.224.215.205
183.225.31.40
192.74.239.67
206.119.125.39
220.197.235.193
223.104.174.151
223.104.174.17
39.144.145.107
39.144.145.244
42.243.22.49
43.153.3.60
45.11.104.46
45.11.104.78
51.89.122.51

jakubgs commented 2 years ago

And these IPs appear to be mostly coming from China:

 > awk -F'[@:]' '/enode/{print $4}' list | sort -u | xargs -I{} bash -c 'whois {} | grep netname' | sort -u
netname:        CHINANET-GD
netname:        CHINANET-JX
netname:        CHINANET-YN
netname:        CMNET
netname:        HK-MISAKA-HKG12
netname:        OVH-DEDICATED-FO
netname:        Time_Quantum_Technology
netname:        UNICOM
netname:        UNICOM-GD

 > awk -F'[@:]' '/enode/{print $4}' list | sort -u | xargs -I{} bash -c 'whois {} | grep descr' | sort -u       
descr:          Beijing 100032
descr:          Beijing 100140 ,P.R.China
descr:          China Mobile communications corporation
descr:          China Mobile Communications Corporation
descr:          CHINANET Guangdong province network
descr:          CHINANET JIANGXI PROVINCE NETWORK
descr:          CHINANET YunNan PROVINCE NETWORK
descr:          China Telecom
descr:          China Unicom
descr:          China Unicom CHINA169 Guangdong Province Network
descr:          China Unicom Guangdong province network
descr:          China Unicom Network
descr:          Data Communication Division
descr:          Failover IPs
descr:          Internet Service Provider in China
descr:          Misaka Network, Inc.
descr:          Mobile Communications Network Operator in China
descr:          No.21 Financial Street,Xicheng District,
descr:          No.31,jingrong street
descr:          Time Quantum Technology LTD

jakubgs commented 2 years ago

I will attempt to do similar analytics today starting from 12:00 UTC, to get some more info on who this is.

jakubgs commented 2 years ago

Interesting, something happened overnight to our hosts in Hong Kong:

Looks like the metrics disappeared for the Hong Kong hosts.

jakubgs commented 2 years ago

Weird, I can see the metrics just fine in Prometheus:

jakubgs commented 2 years ago

It's worth noting that only history(mail) nodes get the spikes. Relay(whisper) nodes do not.

jakubgs commented 2 years ago

Currently our history nodes have around 2-4 GB of RAM. To double the resource it would cost ~370 EUR per month:

(3*20)+(3*49)+(3*52)

Although the Google Cloud history nodes already have 4 GB and they also see a lot of SWAPping:

One thing I could do is try bumping up container memory limits and adding memory limits:

admin@mail-01.gc-us-central1-a.eth.prod:/docker/statusd-mail % cat docker-compose.yml | grep -A2 resources
      resources:
        # some limits due to statusd hogging resources
        limits:
          memory: '2576m'

jakubgs commented 2 years ago

I bumped the Digital Ocean hosts to s-2vcpu-4gb: https://github.com/status-im/infra-eth-cluster/commit/5997c5ff

 > a mail -a 'free -m'
mail-01.do-ams3.eth.prod | CHANGED | rc=0 >>
              total        used        free      shared  buff/cache   available
Mem:           3931         871        1383         139        1676        2713
Swap:          2047           0        2047
mail-02.do-ams3.eth.prod | CHANGED | rc=0 >>
              total        used        free      shared  buff/cache   available
Mem:           3931         604         135          92        3191        3063
Swap:          2047           0        2047
mail-03.do-ams3.eth.prod | CHANGED | rc=0 >>
              total        used        free      shared  buff/cache   available
Mem:           3931         798         103         139        3029        2757
Swap:          2047         172        1875

jakubgs commented 2 years ago

We had another spike, like clockwork. About the same size as yesterday:

jakubgs commented 2 years ago

I collected the peer enodes and names using the admin_peers RPC call:

 > wc -l 1* 
    97 1307.log
   124 1310.log
   150 1314.log
   186 1317.log
   335 1320.log
   469 1324.log
   463 1327.log
   455 1330.log
   407 1334.log
   262 1339.log
   217 1343.log
   169 1347.log
   147 1351.log
   117 1355.log
   105 1400.log
  3703 total

jakubgs commented 2 years ago

The IPs appear to be mostly from China:

 > awk -F'[@:]' '/enode/{print $4}' 1324.log | sort -u | xargs -n1 whois | awk -F: '/descr/{l[$2]++}END{for(n in l){print l[n] " - " n}}' | sort -h | tail -n16
5 -           No.21-3, Sec. 1, Xinyi Rd., Taipei 10048, Taiwan, R.O.C.
5 -           No.21, Sec.1, Xinyi Rd., Taipei City
5 -           Taipei Taiwan
6 -           YISP
6 -           Your Internet Service Provider
7 -           China Mobile communications corporation
10 -           Chunghwa Telecom Co.,Ltd.
11 -           China Unicom
14 -           Internet Service Provider in China
14 -           Mobile Communications Network Operator in China
15 -           China Mobile Communications Corporation
15 -           Time Quantum Technology LTD
22 -           CHINANET YunNan PROVINCE NETWORK
30 -           Beijing 100032
30 -           No.31,jingrong street
31 -           China Telecom

And there's two that generate the majority of peers:

 > awk -F'[@:]' '/enode/{l[$4]++}END{for (a in l) { print l[a] " - " a}}' 1324.log | sort -h | tail -n5
7 - 92.118.236.91
8 - 206.119.125.40
10 - 42.243.22.112
24 - 45.11.104.78
30 - 45.11.104.46

jakubgs commented 2 years ago

One simple thing we could try today is add the most noisy IPs to sshguard4 ipset:

admin@mail-01.do-ams3.eth.prod:~ % t -L INPUT
Chain INPUT (policy DROP)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             match-set sshguard4 src
ACCEPT     all  --  anywhere             anywhere             /* loopback interface */
ACCEPT     all  --  anywhere             anywhere             /* docker0 interface */
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere             /* ICMP */
CONSUL     all  --  anywhere             anywhere             /* CONSUL */
SERVICES   all  --  anywhere             anywhere             /* SERVICES */
VPN        all  --  anywhere             anywhere             /* VPN */

And just drop the packets on the firewall.

jakubgs commented 2 years ago

In theory we could also have SSHGuard analyze status-go logs to find abusers and ban them automatically.

The issue with that is, I don't know what log messages would be good for that, if all they do are valid connections. @Samyoul @cammellos any ideas?

jakubgs commented 2 years ago

Actually, I don't think sshguard can easily be extended to parse custom logs: https://bitbucket.org/sshguard/sshguard/src/653c2ddc/src/parser/attack_scanner.l#lines-180

jakubgs commented 2 years ago

Today there was no spike:

Interesting.

jakubgs commented 2 years ago

Looks like the spikes have calmed down, but the 30-70 nodes using 0.67.0 still hang around:

Might be calm before the storm.

jakubgs commented 2 years ago

Since this has calmed down I'm closing this, but might reopen if we see this behavior again.

status-im / infra-eth-cluster

Traffic spike causes memory issues and timeouts #46