hercules-390 / hyperion

Hercules 390
Other
248 stars 67 forks source link

HHC90509D COMM: accept() failed: Too many open files #156

Closed lnlyssg closed 7 years ago

lnlyssg commented 7 years ago

Today I attempted to connect to my LPAR which had been running around 5 days and found the 3270 emulator was presenting a blank screen only. I logged on to the system's master console which appeared to be unresponsive. I then logged on to the Hercules console and found it swamped with "HHC90509D COMM: accept() failed: Too many open files" messages that were scrolling off the screen faster than I could see. I can't find any hits at all for what this message may mean and at this point could not do anything but quit Hercules as it's essentially unusable in this state. Unfortunately I didn't write to a log file this time but will do so when I restart it and add that info to this issue if it happens again.

I'm running on Ubuntu 16.04 and a build of Hercules from last week.

jphartmann commented 7 years ago

accept() is the system call to process a connection request on a socket, such as a 3270 device or a 1052C. It creates a new file descriptor that represents the connection.

If your LPAR is on the Big Bad Internet, you might have been under attack.

In either case it looks as if the the device manager forgets to close the socket after use. This leads to resource leakage, but as you are allowed a fair number of open file descriptors, this would not show up in normal testing.

It is rater weird that you say blank screen. Presumably you mean that no screen output is received.

On 09/26/2016 04:10 PM, Jim wrote:

Today I attempted to connect to my LPAR which had been running around 5 days and found the 3270 emulator was presenting a blank screen only. I logged on to the system's master console which appeared to be unresponsive. I then logged on to the Hercules console and found it swamped with "HHC90509D COMM: accept() failed: Too many open files" messages that were scrolling off the screen faster than I could see. I can't find any hits at all for what this message may mean and at this point could not do anything but quit Hercules as it's essentially unusable in this state. Unfortunately I didn't write to a log file this time but will do so when I restart it and add that info to this issue if it happens again.

I'm running on Ubuntu 16.04 and a build of Hercules from last week.

lnlyssg commented 7 years ago

Aha, "big bad internet" may indeed be the issue. I had port 23 forwarded to the internal port to try and get round my office's firewall that restricts "non-standard" ports for Telnet. I'll switch that off and hopefully it won't re-occur.

And yes, blank screen = no output. I was connected OK but nothing was showing in the 3270 emulator at all.

I'll consider this closed. Thanks for the pointer.

jphartmann commented 7 years ago

Please open an issue in the tracker. There is also a bug in Hercules.

On 09/26/2016 05:02 PM, Jim wrote:

Aha, "big bad internet" may indeed be the issue. I had port 23 forwarded to the internal port to try and get round my office's firewall that restricts "non-standard" ports for Telnet. I'll switch that off and hopefully it won't re-occur.

And yes, blank screen = no output. I was connected OK but nothing was showing in the 3270 emulator at all.

I'll consider this closed. Thanks for the pointer.

Fish-Git commented 7 years ago

In either case it looks as if the the device manager forgets to close the socket after use.

Prove it.

Please open an issue in the tracker. There is also a bug in Hercules.

Where?

jphartmann commented 7 years ago

The man is out of file descriptors. And the 3270 server cannot accept connections. Either it is being drowned or there is a bug retrying accept().

Fish-Git commented 7 years ago

The man is out of file descriptors. And the 3270 server cannot accept connections.

I understand that.

But you're claiming it's Hercules's fault. You're claiming Hercules is not closing its sockets. I'm asking you to prove that that is indeed what is actually happening. A simple review of the code reveals all sockets that the console code opens eventually get closed. The logic is not that hard to follow. It's very simple and straightforward.

You say "it looks as if..." (the the device manager forgets to close the socket after use), but I suspect you're basing such a claim solely on the reported behavior and not on a simple review of the code. A simple review of the code should tell you right away the code is fine. It could well be a bug in Ubuntu after all:

http://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit

Based on the above web page, if it's indeed Hercules that is leaking sockets then it should be fairly trivial to prove that via lsof, yes? I'm simply asking for that proof. I don't work on guesswork or pure speculation. I deal with facts. You haven't proven your claimed fact.

Either it is being drowned or there is a bug retrying accept().

If there's a bug retrying the accept then it's the accept API that's at fault, not Hercules.

As far as being drowned, yes, it's obvious that is occurring. The telnet socket close logic attempts to perform a "graceful close" of the socket as per accepted socket programming standards by reading from the socket (and discarding whatever was read) until EOF occurs (indicating the other side has closed the connection). Once the other side has closed the connection then telnet will close the socket on its end. That's called a graceful close.

It only loops, reading from the socket and waiting for the other side to close the connection (i.e. EOF) for one second maximum. If the other side does not close its end of the connection then telnet gives up and forcefully closes the socket. You can see this in the telnet_closesocket function.

So if the other side is never closing their socket, then there will be a one second delay before Hercules's socket will be eventually closed. During that one second a malicious actor could easily flood your port with more connection requests than there are sockets (or file descriptors) available for Hercules to use, thus leading to the "Too many open files" error.

But that's not Hercules's fault, and there's not much of anything that Hercules can do about it either (other than to drastically reduce the LIBTN_GRACEFUL_SOCKCLOSESECS value, but then even that might not be enough to resolve the problem either so I wouldn't recommend even bothering).

Bottom line: this does not IMHO appear to be a Hercules bug!

If you can manage to provide convincing proof that this is indeed a Hercules bug, then feel free to re-open this issue by all means.

But until convincing proof to the contrary is provided, this issue should remain closed.

ivan-w commented 7 years ago

Are you guys going to open and close this issue over and over again ?

On 9/28/2016 10:33 PM, Fish-Git wrote:

Closed #156 https://github.com/hercules-390/hyperion/issues/156.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hercules-390/hyperion/issues/156#event-806061797, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjMW0yd08EF-eSDlfjDNuUJPpIAzG6rks5qus81gaJpZM4KGki7.

ivan-w commented 7 years ago

The issue is that any single process "telnet" server expecting a complete "telnet" protocol exchange of do/dont/will/wont is open to a DoS attack leading to a FD shortage (or a temporary TCP port shortage - which can become a system wide problem)

That's not a "bug" per-se, but it's something that needs to be dealt with (especially timeouts ?).

Also another issue is repetitive identical log messages which tend to flood the console. (10001 times the same message should be identified as : * Same message issued 10000 times * ... Or something in the idea) whoch becomes a usability issue under some circumstances.

I wouldn't "close" the issue because there are actually TWO issues there

--Ivan

On 9/28/2016 10:33 PM, Fish-Git wrote:

Closed #156 https://github.com/hercules-390/hyperion/issues/156.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hercules-390/hyperion/issues/156#event-806061797, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjMW0yd08EF-eSDlfjDNuUJPpIAzG6rks5qus81gaJpZM4KGki7.

jphartmann commented 7 years ago

Bless your heart, Ivan,

ivan-w commented 7 years ago

1) We could probably limit the number of accept() up to the sum of 3215+3270 number of devices available. 2) "libtelnet" should have some sort of timeout safeguard between (Do/Will, Want/Wont request/responses) - and close anything that doesn't respond in time (it may require some option if someone wants to implement tn3270 over an RFC1149 implementation of IP).

It's a proposal.... I have no idea how to implement this.

As far as repeating log messages, which may flood the console, I think we have enough infrastructure to deal with it - because it has a lot of abstraction into it

--Ivan

lnlyssg commented 7 years ago

Checking back in....in the interests of science I re-enabled port forwarding and set Hercules to write to a log file. Unfortunately said log file filled up my SSD and is almost 50GB thanks to the repeated "too many open files" messages! Would an abbreviated version be of any use?

jphartmann commented 7 years ago

Presumably the interesting bit is at the beginning?

lnlyssg commented 7 years ago

I've jumped to the point where everything started to fall over. There are countless repeated messages with the 10:20:41 timestamp (well over 25,000 of them...)

10:17:33 HHC02915I client 13560 COMM: Connection received 10:17:43 HHC02915I client 13561 COMM: Connection received 10:18:04 HHC02908E client 13560 COMM: Connection closed during negotiations 10:18:04 HHC02915I client 13562 COMM: Connection received 10:18:10 HHC02915I client 13563 COMM: Connection received 10:18:14 HHC02915I client 13564 COMM: Connection received 10:18:35 HHC02908E client 13562 COMM: Connection closed during negotiations 10:18:35 HHC02915I client 13565 COMM: Connection received 10:18:41 HHC02908E client 13563 COMM: Connection closed during negotiations 10:18:45 HHC02915I client 13566 COMM: Connection received 10:19:06 HHC02908E client 13565 COMM: Connection closed during negotiations 10:19:06 HHC02915I client 13567 COMM: Connection received 10:19:16 HHC02915I client 13568 COMM: Connection received 10:19:32 HHC02915I client 13569 COMM: Connection received 10:19:37 HHC02908E client 13567 COMM: Connection closed during negotiations 10:19:37 HHC02915I client 13570 COMM: Connection received 10:19:47 HHC02915I client 13571 COMM: Connection received 10:20:04 HHC02908E client 13569 COMM: Connection closed during negotiations 10:20:08 HHC02915I client 13572 COMM: Connection received 10:20:08 HHC02908E client 13570 COMM: Connection closed during negotiations 10:20:08 HHC02915I client 13573 COMM: Connection received 10:20:18 HHC02915I client 13574 COMM: Connection received 10:20:38 HHC02908E client 13572 COMM: Connection closed during negotiations 10:20:39 HHC02915I client 13575 COMM: Connection received 10:20:39 HHC02908E client 13573 COMM: Connection closed during negotiations 10:20:39 HHC02915I client 13576 COMM: Connection received 10:20:41 HHC90509D COMM: accept() failed: Too many open files 10:20:41 HHC90509D COMM: accept() failed: Too many open files 10:20:41 HHC90509D COMM: accept() failed: Too many open files

jphartmann commented 7 years ago

It would be interesting to see the beginning too.

It does look as if connexions 13561, 13564, 13566, 13568, and so on are not closed. Thus, the leakage builds up by connexions that stall without completing negotiation.

Any idea what/who is flooding you?

On line 3374 of console.c the author tries to quench too many messages, but on line 3421, this seems to have been forgotten.

Still, suppressing the message, but still looping, is close to shooting the messenger. Looping blindly on accept() errors may not be the best strategy for resilient code.

Fish-Git commented 7 years ago

On line 3374 of console.c the author tries to quench too many messages, but on line 3421, this seems to have been forgotten.

The message flood issue can occur anywhere in Hercules code and it might be better to try and deal with it in one place within our message handling code as Ivan suggested, rather than at each individual location such as line 3374 of console.c. To address each location individually would only make Hercules code more complicated and harder to maintain.

It does look as if connexions 13561, 13564, 13566, 13568, and so on are not closed. Thus, the leakage builds up by connexions that stall without completing negotiation.

It looks that way, yes. But adding specialized timeout logic to libtelnet to try and mitigate attacks on the host system running Hercules does not, to me, seem proper (and probably much more work than it's worth given the low likelihood of it occurring under normal circumstances).

Instead, the system or network administrator of the host system running Hercules should be responsible for preventing/mitigating suck attacks on Hercules, not Hercules itself. Hercules is simply a casualty. It is the system/network admin who should be the one responsible for preventing such attacks IMO, rather than Hercules having specialized code to try and mask them or otherwise have to deal with them.

lnlyssg commented 7 years ago

Any idea what/who is flooding you?

I haven't had port 23 open on my router before now but I assume it's botnets/script kiddies looking for open Telnet servers.

Instead, the system or network administrator of the host system running Hercules should be responsible for preventing/mitigating suck attacks on Hercules, not Hercules itself. Hercules is simply a casualty. It is the system/network admin who should be the one responsible for preventing such attacks IMO, rather than Hercules having specialized code to try and mask them or otherwise have to deal with them.

I agree and this is clearly a fringe case so I'm not overly bothered about running out of connections, however the message flood is the biggest issue as it leaves Hercules dead in the water and I can't then cleanly shut down as even the master console is unresponsive.

Fish-Git commented 7 years ago

... however the message flood is the biggest issue as it leaves Hercules dead in the water and I can't then cleanly shut down as even the master console is unresponsive.

Understood. I'm looking into it.

Fish-Git commented 7 years ago

Upon further research it is my opinion that implementing a more proper "(previous message repeated %d times)" duplicate-message message-flood protection technique within logger.c's logger_thread function would be more trouble than it would be worth (IMHO).

I have therefore taken the lazy "quick fix" approach instead by implementing the exact same "throttle error message rate" technique already being used for pselect errors.