sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

load testing and profiling key take-away notes #1276

Closed echlebek closed 5 years ago

echlebek commented 6 years ago

Today, a load test was performed on GCP of sensu-backend and sensu-agent. Two nodes were involved in the test.

sensu-backend n1-standard-8 (8 vCPUs, 30 GB memory), 100 GB SSD

sensu-agent n1-highcpu-32 (32 vCPUs, 28.8 GB memory)

1000 agents were registered. Afterwards, a check was created to ping -c1 localhost, at a frequency of once per second. (This is this highest configurable check rate.)

An additional 2000 agents were registered. No errors were seen in the logs, but I suspect that we were losing log data in the flurry of log activity from the application. (Every check request to every agent was logged at the debug level, and we haven't got a mode for setting the log level yet.)

By all other indications everything was stable at 3000 agents.

An additional 2000 agents were registered.

At 4600 agents, deregistrations began to occur. The exact cause of these particular deregistrations is not determined. In the past tests, deregistration was highly correlated with etcd read timeouts. But it has also occurred due to deadlock conditions, and socket timeouts.

While there was observed instability with etcd at a high write volume in the small initial load tests, it became clear later on that the standard 10 GB disk simply couldn't deliver a high enough IOPS to feasibly run etcd under load. When an SSD was used, no etcd timeouts occurred until a high load was placed on the service.

Note that GCP provisions its SSD IOPS per-gigabyte. We used a 100 GB SSD, which claims to deliver 3,000 sustained IOPS, read and write, and 48 MB/s.

At lower load levels we saw a lot of CPU time being spent on marshaling and unmarshaling. This is something that is easy for us to optimize.

(pprof) top10 -cum
Showing nodes accounting for 3.18s, 5.23% of 60.79s total
Dropped 954 nodes (cum <= 0.30s)
Showing top 10 nodes out of 349
      flat  flat%   sum%        cum   cum%
     0.01s 0.016% 0.016%     17.07s 28.08%  github.com/sensu/sensu-go/backend/eventd.(*Eventd).startHandlers.func1
     0.01s 0.016% 0.033%     17.04s 28.03%  github.com/sensu/sensu-go/backend/eventd.(*Eventd).handleMessage
     0.10s  0.16%   0.2%     15.59s 25.65%  runtime.systemstack
     0.01s 0.016%  0.21%     14.30s 23.52%  encoding/json.Unmarshal
     0.01s 0.016%  0.23%     13.95s 22.95%  encoding/json.(*decodeState).unmarshal
     0.29s  0.48%  0.71%     13.92s 22.90%  encoding/json.(*decodeState).object
     0.03s 0.049%  0.76%     13.92s 22.90%  encoding/json.(*decodeState).value
     0.03s 0.049%  0.81%     12.93s 21.27%  github.com/sensu/sensu-go/types.(*Event).UnmarshalJSON
     0.05s 0.082%  0.89%     12.62s 20.76%  github.com/sensu/sensu-go/types/dynamic.Unmarshal
     2.64s  4.34%  5.23%     10.75s 17.68%  runtime.mallocgc

The test was writing an unfortunately large amount of logging information, and that behaviour needs to be dealt with before we run another load test. It was interfering with both system operation and with visibility.

nikkictl commented 6 years ago

Wondering if this is a typo.. but I would expect the sensu-backend to be the high-cpu model and the sensu-agent to be the more standard machine. Unless I'm misunderstanding something!

echlebek commented 6 years ago

@nikkiattea, in a production scenario, you would typically have more cores running the agent than the backend. For example, a single four core box might monitor 100 four core boxes. I selected a high-cpu model because we didn't need a lot of RAM, and needed to be able to execute thousands of subprocesses per second on the agent node for the ping checks.

nikkictl commented 6 years ago

Oh right right, we have all of the agent processes running on the same machine. Hence more compute. Totally forgot about that!

annaplotkin commented 5 years ago

Closing.