Closed jakubgs closed 2 years ago
Clear spike in SWAP usage and load:
Nodes that have been affected by this:
node-05.gc-us-central1-a.eth.prod
node-07.gc-us-central1-a.eth.prod
node-03.ac-cn-hongkong-c.eth.prod
node-07.ac-cn-hongkong-c.eth.prod
So far only relay nodes, no history nodes.
I see that on node-07.ac-cn-hongkong-c.eth.prod
the node process was killed due to memory usage:
[1116909.102556] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,mems_allowed=0,oom_memcg=/docker/86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,task_memcg=/docker/86d42ece80c2d6499fb0be9614053addb22503ce29b162403b6d5d7a85cfd8a3,task=statusd,pid=1948,uid=100000
[1116909.102618] Memory cgroup out of memory: Killed process 1948 (statusd) total-vm:8276308kB, anon-rss:4126008kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:15636kB oom_score_adj:0
[1116909.729559] oom_reaper: reaped process 1948 (statusd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Which cause the container to restart and recover.
Other traffic spikes can be seen:
But they don't all turn into memory usage spikes:
Numbers of peers connected to the node appears to drop during the spike, then recover:
We can see that the main traffic spike also causes an iowait
spike:
And we can see blocked processes:
And a nice spike in page faults and swap usage by the container:
What's interesting is that the container is limited to 4.4GB of memory, but never reaches that:
It appears to start SWAPing around 2.4GB used.
Looks like we had the same thing happen on mail-02.gc-us-central1-a.eth.prod
:
[1208334.768786] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/docker/3c4c07aabcb120555b917167936109662b1e3d6b63703e0be65282918b709047,task_memcg=/docker/3c4c07aabcb120555b917167936109662b1e3d6b63703e0be65282918b709047,task=statusd,pid=1997,uid=100000
[1208334.768813] Memory cgroup out of memory: Killed process 1997 (statusd) total-vm:4728752kB, anon-rss:2545876kB, file-rss:0kB, shmem-rss:0kB, UID:100000 pgtables:9120kB oom_score_adj:0
[1208335.145990] oom_reaper: reaped process 1997 (statusd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
So it's not just relay nodes.
Looks like we got hit again, 5 hosts at the same time:
mail-01.do-ams3.eth.prod
mail-02.do-ams3.eth.prod
node-06.ac-cn-hongkong-c.eth.prod
node-07.ac-cn-hongkong-c.eth.prod
node-08.ac-cn-hongkong-c.eth.prod
All experienced the same traffic spike started at the same time:
Wait, it's not just select hosts, other hosts saw a spike in traffic too, like node-03.ac-cn-hongkong-c.eth.prod
:
But it didn't kill the node:
Yup, pretty much all hosts had the network traffic spike:
But only some fail, probably because the were not restarted recently and already had high memory usage.
We can see a bit of an increase in envelopes, but not a huge one:
Nothing compared to the network traffic.
But a much clearer spike in p2p traffic:
So it's not envelopes, but big envelopes, or just high volume p2p traffic.
And a very big spike in receiving expired envelopes:
Not sure what that means though.
I graphed the waku_envelopes_size_bytes_bucket
histogram and I'm not seeing a spike in bigger envelopes:
Which means we're probably not being spammed by big envelopes.
If we go by p2p inbound traffic we can see there have been 4 spikes in the last 2 days, not regular:
But today was the first really big spike in envelope cache failures:
And matching spike in received envelopes:
But not new envelopes, as we can see in graph in https://github.com/status-im/infra-eth-cluster/issues/46#issuecomment-1017618136.
@Samyoul do you have any thoughts about this? Can you think of any reason this might not be caused by external events?
Today we had simultaneous failure of 4 production nodes:
node-05.do-ams3.eth.prod
mail-01.do-ams3.eth.prod
node-04.ac-cn-hongkong-c.eth.prod
node-08.gc-us-central1-a.eth.prod
Which all were heavily SWAPing when I checked them due to Consul healthcheck failures.
We see gradual CPU load growth, and sudden spike in memory and SWAP usage until there's none left:
We also see a noticeable raise in allocated sockets right before the memory usage spike appears:
The network traffic is quite irregular, but there's no clear spike at the time:
If we look at a long term pattern across 7 days we see gradual increase in memory usage and then big spike:
Though counts of tcp socket states does not correlate with this growth:
@Samyoul @cammellos any ideas? If we want to tackle this by increasing host size it will cost quite a lot, and I'm not sure if the issue actually is just a scaling issue, since out numbers of peers have not changed much for the past 2 months.
Although, indeed, if we look closed on mail-03.do-ams3.eth.prod
peer metrics there was a spike in number of peers:
At the time we had a clear spike in envelope cache failures:
But if we look closer there is a spike in mobile peers:
Which appears to be periodic:
And 12:30
appears to be the time it occurs. What is this? Some kind of testing?
What's weird is that the spikes come from 0.67.0
version:
At least that's how they are advertised.
Which is quite old, from 23rd of December 2020: https://github.com/status-im/status-go/releases/tag/v0.67.0
And considering the gradually increasing scale of the spike this might be malicious probing by some script kiddie.
One idea suggested by @cammellos was that it might be an issue with the canary, but that's different architecture and version:
admin@node-01.do-ams3.canary.hq:/docker/cabot/repos/status-go % git status
HEAD detached at v0.63.12
nothing to commit, working tree clean
I collected all currently connected peers using 0.67.0
version with:
ansible mail,whisper -a '/docker/statusd-mail/rpc.sh admin_peers | jq -c ".result[] | { enode, name }" | grep "0.67.0"'
And this is what I got, 38 unique enodes: https://gist.github.com/jakubgs/a898c9d527a59c848b569ae407fe96b5 But if I ignore peer IDs and ports I get only 26 IP addresses:
106.57.73.140
106.61.114.59
106.61.135.1
107.148.202.41
112.96.106.134
113.102.233.45
117.136.72.201
117.136.72.31
154.84.1.121
154.84.1.16
161.35.191.40
182.97.219.6
183.224.215.205
183.225.31.40
192.74.239.67
206.119.125.39
220.197.235.193
223.104.174.151
223.104.174.17
39.144.145.107
39.144.145.244
42.243.22.49
43.153.3.60
45.11.104.46
45.11.104.78
51.89.122.51
And these IPs appear to be mostly coming from China:
> awk -F'[@:]' '/enode/{print $4}' list | sort -u | xargs -I{} bash -c 'whois {} | grep netname' | sort -u
netname: CHINANET-GD
netname: CHINANET-JX
netname: CHINANET-YN
netname: CMNET
netname: HK-MISAKA-HKG12
netname: OVH-DEDICATED-FO
netname: Time_Quantum_Technology
netname: UNICOM
netname: UNICOM-GD
> awk -F'[@:]' '/enode/{print $4}' list | sort -u | xargs -I{} bash -c 'whois {} | grep descr' | sort -u
descr: Beijing 100032
descr: Beijing 100140 ,P.R.China
descr: China Mobile communications corporation
descr: China Mobile Communications Corporation
descr: CHINANET Guangdong province network
descr: CHINANET JIANGXI PROVINCE NETWORK
descr: CHINANET YunNan PROVINCE NETWORK
descr: China Telecom
descr: China Unicom
descr: China Unicom CHINA169 Guangdong Province Network
descr: China Unicom Guangdong province network
descr: China Unicom Network
descr: Data Communication Division
descr: Failover IPs
descr: Internet Service Provider in China
descr: Misaka Network, Inc.
descr: Mobile Communications Network Operator in China
descr: No.21 Financial Street,Xicheng District,
descr: No.31,jingrong street
descr: Time Quantum Technology LTD
I will attempt to do similar analytics today starting from 12:00 UTC, to get some more info on who this is.
Interesting, something happened overnight to our hosts in Hong Kong:
Looks like the metrics disappeared for the Hong Kong hosts.
Weird, I can see the metrics just fine in Prometheus:
It's worth noting that only history(mail
) nodes get the spikes. Relay(whisper
) nodes do not.
Currently our history nodes have around 2-4 GB of RAM. To double the resource it would cost ~370 EUR per month:
(3*20)+(3*49)+(3*52)
Although the Google Cloud history nodes already have 4 GB and they also see a lot of SWAPping:
One thing I could do is try bumping up container memory limits and adding memory limits:
admin@mail-01.gc-us-central1-a.eth.prod:/docker/statusd-mail % cat docker-compose.yml | grep -A2 resources
resources:
# some limits due to statusd hogging resources
limits:
memory: '2576m'
I bumped the Digital Ocean hosts to s-2vcpu-4gb
: https://github.com/status-im/infra-eth-cluster/commit/5997c5ff
> a mail -a 'free -m'
mail-01.do-ams3.eth.prod | CHANGED | rc=0 >>
total used free shared buff/cache available
Mem: 3931 871 1383 139 1676 2713
Swap: 2047 0 2047
mail-02.do-ams3.eth.prod | CHANGED | rc=0 >>
total used free shared buff/cache available
Mem: 3931 604 135 92 3191 3063
Swap: 2047 0 2047
mail-03.do-ams3.eth.prod | CHANGED | rc=0 >>
total used free shared buff/cache available
Mem: 3931 798 103 139 3029 2757
Swap: 2047 172 1875
We had another spike, like clockwork. About the same size as yesterday:
I collected the peer enodes and names using the admin_peers
RPC call:
> wc -l 1*
97 1307.log
124 1310.log
150 1314.log
186 1317.log
335 1320.log
469 1324.log
463 1327.log
455 1330.log
407 1334.log
262 1339.log
217 1343.log
169 1347.log
147 1351.log
117 1355.log
105 1400.log
3703 total
The IPs appear to be mostly from China:
> awk -F'[@:]' '/enode/{print $4}' 1324.log | sort -u | xargs -n1 whois | awk -F: '/descr/{l[$2]++}END{for(n in l){print l[n] " - " n}}' | sort -h | tail -n16
5 - No.21-3, Sec. 1, Xinyi Rd., Taipei 10048, Taiwan, R.O.C.
5 - No.21, Sec.1, Xinyi Rd., Taipei City
5 - Taipei Taiwan
6 - YISP
6 - Your Internet Service Provider
7 - China Mobile communications corporation
10 - Chunghwa Telecom Co.,Ltd.
11 - China Unicom
14 - Internet Service Provider in China
14 - Mobile Communications Network Operator in China
15 - China Mobile Communications Corporation
15 - Time Quantum Technology LTD
22 - CHINANET YunNan PROVINCE NETWORK
30 - Beijing 100032
30 - No.31,jingrong street
31 - China Telecom
And there's two that generate the majority of peers:
> awk -F'[@:]' '/enode/{l[$4]++}END{for (a in l) { print l[a] " - " a}}' 1324.log | sort -h | tail -n5
7 - 92.118.236.91
8 - 206.119.125.40
10 - 42.243.22.112
24 - 45.11.104.78
30 - 45.11.104.46
One simple thing we could try today is add the most noisy IPs to sshguard4
ipset:
admin@mail-01.do-ams3.eth.prod:~ % t -L INPUT
Chain INPUT (policy DROP)
target prot opt source destination
DROP all -- anywhere anywhere match-set sshguard4 src
ACCEPT all -- anywhere anywhere /* loopback interface */
ACCEPT all -- anywhere anywhere /* docker0 interface */
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT icmp -- anywhere anywhere /* ICMP */
CONSUL all -- anywhere anywhere /* CONSUL */
SERVICES all -- anywhere anywhere /* SERVICES */
VPN all -- anywhere anywhere /* VPN */
And just drop the packets on the firewall.
In theory we could also have SSHGuard analyze status-go
logs to find abusers and ban them automatically.
The issue with that is, I don't know what log messages would be good for that, if all they do are valid connections. @Samyoul @cammellos any ideas?
Actually, I don't think sshguard
can easily be extended to parse custom logs:
https://bitbucket.org/sshguard/sshguard/src/653c2ddc/src/parser/attack_scanner.l#lines-180
Today there was no spike:
Interesting.
Looks like the spikes have calmed down, but the 30-70 nodes using 0.67.0
still hang around:
Might be calm before the storm.
Since this has calmed down I'm closing this, but might reopen if we see this behavior again.
Some of our hosts are seeing traffic spikes that lead to a spike in memory usage and CPU usage and then unavailability:
The host then appears as unresponsive and timing out when connections are attempted: