sonic-net / SONiC

Landing page for Software for Open Networking in the Cloud (SONiC) - https://sonic-net.github.io/SONiC/
2.29k stars 1.14k forks source link

TACACS long wait when configuring non active server with high priority #1462

Open ycoheNvidia opened 1 year ago

ycoheNvidia commented 1 year ago

We have encountered an issue while configuring and authenticating Tacacs servers. What is happening The scenario happens when a non active/non existent server (server A) is being configured with higher priority than another active server (server B), with aaa fallback and failthrough enabled. If we try to authenticate with a remote user defined in server B using ssh for example, we get significant wait times for connection to be established, such as 30-50 seconds each time. Additional research For user names that were created locally (like admin or others using useradd command) or authenticated and established before with a radius server - we did not encounter these delays. After examining the debug logs we suspect that the source of the issue is somewhere between linux pam and tacplus_pam, where while user connection pam calls tacacs server authentication multiple times, as it is checking user permissions - waiting full timeout for each check, In addition, when using a valid server as first priority to authenticate - we still see these multiple authentication requests logged in tacplus_pam and libnss tacplus libraries, but since there is no significant delay for each request- the session is established in a reasonable time (mostly less than 2 seconds).

We would like to know if this is a know limitation for TACACS in SONiC, since documentation in pam_tacplus library used by SONiC specifically states that only one active server is being used after first authentication (from https://github.com/kravietz/pam_tacplus/blob/main/README.md): "Having more that one TACACS+ server defined for given management group has following effects on authentication:

if the first server on the list is unreachable or failing pam_tacplus will try to authenticate the user against the other servers until it succeeds

the first_hit option has been deprecated

when the authentication function gets a positive reply from a server, it saves its address for future use by account management function (see below)

The account management (authorization) function asks only one TACACS+ server and it ignores the whole server list passed from command line. It uses server saved by authentication function after successful authenticating user on that server. We assume that the server is authoritative for queries about that user."

Reproduction steps:

  1. configure active server with low priority: sudo config tacacs add ACTIVE_SERVER_IP -k ACTIVE_SERVER_KEY -p 1
  2. configure aaa authentication: sudo config aaa authentication login tacacs+ local
  3. configure aaa failthrough: sudo config aaa authentication failthrough enable try to authenticate with remote user (sshpass -p remoteuserpassword ssh remoteuser@switch) -should be done almost immediately
  4. configure non active/dummy server with higher priority: sudo config tacacs add 1.1.1.1 -p 2 Try to authenticate to remote user (sshpass -p remoteuserpassword ssh remoteuser@switch) -should take significate time with long delay.
liuh-80 commented 6 months ago

Verified with KVM, 1 high priority unreachable TACAC server will cause 9seconds delay, which is much faster than 40-50 seconds, this issue seems a platform related issue.

diff --git a/tests/tacacs/test_authorization.py b/tests/tacacs/test_authorization.py index 1e58776cf..46864bb57 100644 --- a/tests/tacacs/test_authorization.py +++ b/tests/tacacs/test_authorization.py @@ -30,7 +30,10 @@ def ssh_connect_remote_retry(remote_ip, remote_username, remote_password, duthos retry_count = 3 while retry_count > 0: try:

@@ -256,7 +259,7 @@ def test_authorization_tacacs_only_some_server_down( Setup multiple tacacs server for this UT. Tacacs server 127.0.0.1 not accessible. """