kravietz / pam_tacplus

TACACS+ protocol client library and PAM module in C. This PAM module support authentication, authorization (account management) and accounting (session management)performed using TACACS+ protocol designed by Cisco.
GNU Lesser General Public License v3.0
132 stars 102 forks source link

getpeername - Transport endpoint is not connected #154

Closed sburnwal closed 4 years ago

sburnwal commented 4 years ago

I have been using libtac for tacacs authentication. While it works all fine most of the time, occasionally I get this error from connect.c file. This is happening in the getpeername(..) call in connect.c. Do you know what could be the issue ?

tac_connect_single: connection failed with server: Transport endpoint is not connected

sburnwal commented 4 years ago

I think I found out the issue. I printed the fd value used. I see that whenever the fd value returned by socket(..) call is > 1024, I get this issue. Then I read this linux doc https://man7.org/linux/man-pages/man2/socket.2.html, under the BUGS section.

POSIX allows an implementation to define an upper limit, advertised
       via the constant FD_SETSIZE, on the range of file descriptors that
       can be specified in a file descriptor set.  The Linux kernel imposes
       no fixed limit, but the glibc implementation makes fd_set a fixed-
       size type, with FD_SETSIZE defined as 1024, and the FD_*() macros
       operating according to that limit.  To monitor file descriptors
       greater than 1023, use poll(2) or epoll(7) instead.

Now, I want to know, what is the possible solution to this issue ? Shall we replace select() with poll() ?

gollub commented 4 years ago

@deastoe , we haven't run into that yet in tacplusd, because we maintain a lower number of fds, right?

sburnwal commented 4 years ago

Just to confirm the logs that I see fd > 1024 occasionally.

2020-06-12 14:28:30 tac_connect_single: connected to server
2020-06-12 14:28:30 tac_connect_single: exit status=0 (fd=64)
2020-06-12 14:28:30 tac_connect_single: connected to server
2020-06-12 14:28:30 tac_connect_single: exit status=0 (fd=64)
2020-06-12 14:31:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:31:00 tac_connect_single: exit status=-9 (fd=1026)
2020-06-12 14:31:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:31:00 tac_connect_single: exit status=-9 (fd=1026)
2020-06-12 14:33:30 tac_connect_single: connected to server
2020-06-12 14:33:30 tac_connect_single: exit status=0 (fd=1027)
2020-06-12 14:33:30 tac_connect_single: connected to server
2020-06-12 14:33:30 tac_connect_single: exit status=0 (fd=1027)
2020-06-12 14:36:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:36:00 tac_connect_single: exit status=-9 (fd=1028)
2020-06-12 14:36:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:36:00 tac_connect_single: exit status=-9 (fd=1028)
2020-06-12 14:38:30 tac_connect_single: connected to server
2020-06-12 14:38:30 tac_connect_single: exit status=0 (fd=946)
2020-06-12 14:38:30 tac_connect_single: connected to server
2020-06-12 14:38:30 tac_connect_single: exit status=0 (fd=946)
2020-06-12 14:41:01 tac_connect_single: connected to server
2020-06-12 14:41:01 tac_connect_single: exit status=0 (fd=946)

But I do want to know since when lower numbered FDs are in place ? Is it from beginning of this library was posted on github or sometime later ? On my side libtac deployed is about 2 years old.

gollub commented 4 years ago

I'm not aware of any libtac change which changes the socket creation behavior in the last 2 years.

Are you implementing a TACACS+ client or TACACS+ server?

The case I was referring to, with "low number of FDs", was referring to a TACACS+ client implementation using, libtac, but not pam_tacplus. In that implementation we might never run into that situation, yet, since we never needed more then a handful sockets/FDs during the process lifecycle.

I wonder why you have that many FDs? I guess the pam_tacplus use-cases or other TACACS+ client implementations, which are using libtrac, have not hit that problem, since they just don't maintain that many connections simultaneously.

At least I'm not yet aware of any TACACS+ server implementation using libtac. But obviously any contribution to make libtac usable for that deployment role would be more them welcome.

sburnwal commented 4 years ago

Yes I am using libtac in one of my web server application that connects to a tacacs server for doing login to my web server. At times my server would be busy opening FD for web clients and that results into the web server using FD > 1024 for connecting to tacacs server.

gollub commented 4 years ago

Originally libtac was only used by pam_tacplus I guess, so there was never a need for dealing with that high number of FDs. But with the changes we introduced to make libtac a shared-library there are probably now more and more users of the TACACS+ protocol implementation using that in long living processes.

Using poll() rather select() is probably the right way to go to support high FD numbers.

sburnwal commented 4 years ago

Can you also confirm to me that though in connect.c, you set the fd to be non-blocking and select() later, for actual authentication data like in authen_s.c, the read and write calls are blocking ? I do not see select() being used there.

sburnwal commented 4 years ago

As I see there is no non-blocking call for the actual auth or accounting packets. Therefore I have switched to blocking connection test, using tcp_syn_retries, instead of non-blocking select() operation. For me that works fine.