saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.11k stars 5.47k forks source link

[BUG] Substantial performance degradation on TCP transport after transport refactor #61656

Open lukasraska opened 2 years ago

lukasraska commented 2 years ago

Description The big transport refactor (https://github.com/saltstack/salt/pull/61450) that was meant to decouple some general logic from transport and easy the implementation of new transport methods brought an unwanted performance degradation to TCP transport.

In particular the performance degradation can be seen when new job is published to specific minion (target_lst) instead of broadcast. The previous logic for publishing job was pretty straightforward (https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1517 ):

  1. Check if topic_lst was passed (this contains list of all minions the job is meant for)
  2. Loop through all topics (all minions)
  3. Find appropriate connected clients for the minion (there might be multiple connected for single minion because of reasons)
  4. Write the job to the socket

While after the refactor the logic is mostly the opposite (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L910):

  1. Check if topic_lst was passed
  2. Loop through all topics
  3. Loop through all clients
  4. Check if client matches the topic and if so, write the job to the socket

In practice this means that if there is 10k connected minions and new job should be published to just one single minion, it's immediate (one iteration through topic_lst loop and lookup in dict), while after the refactor the worst case scenario is 10k iterations of the clients loop. If we publish job to multiple minions, this just gets worse. For 5k minions in a job in 10k minion setup, the previous complexity would be mostly equal to number of minions in a job, while now it's number of minions in a job times number of connected clients.

The source of the issue stems from how the clients are looked up and how authentication mechanism works in Salt:

  1. When minion connects, it's added into self.clients set (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L905, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1512)
  2. The stream is read and the data is processed (https://github.com/saltstack/salt/blob/master/salt/transport/tcp.py#L885, https://github.com/saltstack/salt/blob/v3004/salt/transport/tcp.py#L1482)
  3. The minion is authenticated and added into present dict => while in the previous version this dict was present directly in the transport internals, now it's decoupled and part of server channel presence_callback (https://github.com/saltstack/salt/blob/master/salt/channel/server.py#L701)
  4. When new job is published, the previous logic used to look into presence dictionary, performing dict lookup, while now it's not available to the TCP transport classes

This also brings additional DDOS-type of issue when one can write simple TCP client that just connects to the master without any extra steps (just hanging the connection) -> the TCP client would be accounted for in the publish loop, instead of relying on the presence dictionary.

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.) Any new Salt Master using TCP transport.

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior See above

Expected behavior Publish of single job on master with 10k minions doesn't take dozens of seconds.

Screenshots n/a

Versions Report n/a, latest master branch

Additional context n/a

lukasraska commented 2 years ago

I'm willing to write PR to address these changes, as discussed on Slack with @waynew but I would definitely need some directions from Salt engineering team (FYI @dwoz ) as any of the options I can think of are not ideal.

The easiest option would be to change the topic_lst to set and change the loop logic to loop through clients instead and perform set lookup over the id. While this decreases the total amount of time in my tests in about a half, still far from ideal.

Second option would be to pass the present dict to the publish_payload from server channel to TCP transport and retain the previous logic. But that does require this is passed for all transports.

Third option would be to wrap around the present callback and essentially maintain copy of the present dict in TCP transport. Whle this is independent on other transports, it brings additional memory load to the master process.

What do you think?

dwoz commented 2 years ago

@lukasraska Here is a lightly tested patch. It should get you pointed in the right direction.

0001-TCP-transport-optimization.patch.txt