Closed sburnwal closed 4 years ago
I think I found out the issue. I printed the fd value used. I see that whenever the fd value returned by socket(..) call is > 1024, I get this issue. Then I read this linux doc https://man7.org/linux/man-pages/man2/socket.2.html, under the BUGS section.
POSIX allows an implementation to define an upper limit, advertised
via the constant FD_SETSIZE, on the range of file descriptors that
can be specified in a file descriptor set. The Linux kernel imposes
no fixed limit, but the glibc implementation makes fd_set a fixed-
size type, with FD_SETSIZE defined as 1024, and the FD_*() macros
operating according to that limit. To monitor file descriptors
greater than 1023, use poll(2) or epoll(7) instead.
Now, I want to know, what is the possible solution to this issue ? Shall we replace select() with poll() ?
@deastoe , we haven't run into that yet in tacplusd, because we maintain a lower number of fds, right?
Just to confirm the logs that I see fd > 1024 occasionally.
2020-06-12 14:28:30 tac_connect_single: connected to server
2020-06-12 14:28:30 tac_connect_single: exit status=0 (fd=64)
2020-06-12 14:28:30 tac_connect_single: connected to server
2020-06-12 14:28:30 tac_connect_single: exit status=0 (fd=64)
2020-06-12 14:31:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:31:00 tac_connect_single: exit status=-9 (fd=1026)
2020-06-12 14:31:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:31:00 tac_connect_single: exit status=-9 (fd=1026)
2020-06-12 14:33:30 tac_connect_single: connected to server
2020-06-12 14:33:30 tac_connect_single: exit status=0 (fd=1027)
2020-06-12 14:33:30 tac_connect_single: connected to server
2020-06-12 14:33:30 tac_connect_single: exit status=0 (fd=1027)
2020-06-12 14:36:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:36:00 tac_connect_single: exit status=-9 (fd=1028)
2020-06-12 14:36:00 tac_connect_single: connection failed with server: Transport endpoint is not connected
2020-06-12 14:36:00 tac_connect_single: exit status=-9 (fd=1028)
2020-06-12 14:38:30 tac_connect_single: connected to server
2020-06-12 14:38:30 tac_connect_single: exit status=0 (fd=946)
2020-06-12 14:38:30 tac_connect_single: connected to server
2020-06-12 14:38:30 tac_connect_single: exit status=0 (fd=946)
2020-06-12 14:41:01 tac_connect_single: connected to server
2020-06-12 14:41:01 tac_connect_single: exit status=0 (fd=946)
But I do want to know since when lower numbered FDs are in place ? Is it from beginning of this library was posted on github or sometime later ? On my side libtac deployed is about 2 years old.
I'm not aware of any libtac change which changes the socket creation behavior in the last 2 years.
Are you implementing a TACACS+ client or TACACS+ server?
The case I was referring to, with "low number of FDs", was referring to a TACACS+ client implementation using, libtac, but not pam_tacplus. In that implementation we might never run into that situation, yet, since we never needed more then a handful sockets/FDs during the process lifecycle.
I wonder why you have that many FDs? I guess the pam_tacplus use-cases or other TACACS+ client implementations, which are using libtrac, have not hit that problem, since they just don't maintain that many connections simultaneously.
At least I'm not yet aware of any TACACS+ server implementation using libtac. But obviously any contribution to make libtac usable for that deployment role would be more them welcome.
Yes I am using libtac in one of my web server application that connects to a tacacs server for doing login to my web server. At times my server would be busy opening FD for web clients and that results into the web server using FD > 1024 for connecting to tacacs server.
Originally libtac was only used by pam_tacplus I guess, so there was never a need for dealing with that high number of FDs. But with the changes we introduced to make libtac a shared-library there are probably now more and more users of the TACACS+ protocol implementation using that in long living processes.
Using poll() rather select() is probably the right way to go to support high FD numbers.
Can you also confirm to me that though in connect.c, you set the fd to be non-blocking and select() later, for actual authentication data like in authen_s.c, the read and write calls are blocking ? I do not see select() being used there.
As I see there is no non-blocking call for the actual auth or accounting packets. Therefore I have switched to blocking connection test, using tcp_syn_retries, instead of non-blocking select() operation. For me that works fine.
I have been using libtac for tacacs authentication. While it works all fine most of the time, occasionally I get this error from connect.c file. This is happening in the getpeername(..) call in connect.c. Do you know what could be the issue ?
tac_connect_single: connection failed with server: Transport endpoint is not connected