Closed meadmaker closed 6 years ago
I can turn down the logging on those messages to "debug" instead of "err" and "notice".
I think the incoming connections are unlikely to come from a trust router - though it might be possible if the trp_connect_interval were set to 0. They default to 10 seconds or more, so this would require a configuration to explicitly set them to something ridiculous. I will go ahead and add validation to ensure that the various intervals are set to reasonable values.
Sound reasonable?
I've implemented the above changes in #92.
I don't know if there's a great way to test this - the particular error above (errno = 24) happens when the process has hit its cap on open files. If we want to see that, probably we can set that to a small number with ulimit
(though that may interfere with trust router operation; it can probably be tweaked to work). The goal is that it should not emit a log message on a failed "accept" unless logging is set to "debug".
This at least tests that we don't add log entries for failed connections except in debug mode.
Log messages at debug level:
==> default.log.201805312043 <==
tr_sock_accept: Incoming connection on fd 19 from 23.30.188.246
==> /var/log/messages <==
May 31 20:58:32 ip-172-31-47-134 trust_router[1431]: tr_sock_accept: Incoming connection on fd 19 from 23.30.188.246
==> default.log.201805312043 <==
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection
tr_trps_process_mq: incoming connection failed to auth.
==> /var/log/messages <==
May 31 20:58:33 ip-172-31-47-134 trust_router[1431]: trps_authorize_connection: failed to authorize connection
May 31 20:58:33 ip-172-31-47-134 trust_router[1431]: tr_trps_process_mq: incoming connection failed to auth.
Log messages at alert level:
==> default.log.201805312059 <==
tr_sock_accept: Incoming connection on fd 21 from 23.30.188.246
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection
tr_trps_process_mq: incoming connection failed to auth.
Looking at the numbers for that, this one aborted connection generates 347 bytes of data. If we get DOS'd by 100 connections a second, then we're generating a GB of log data every 8.6 hours. Do we think that's acceptable?
I can confirm that the messages to /var/log/messages do not get generated when the loglevel is set to 'alarm'.
Assigning to @jennifer-richards and @mrw42 for opinions on my question about whether generating 347 bytes per aborted connection is acceptable.
It strikes me as borderline tolerable.
I can trivially get rid of the messages in the main TR code (the ones with method name prefixes on the messages). Actually, #92 gets rid of the first one (moves it to debug instead of err priority).
The other messages (ReadBuffer..., ReadToken..., and Authenticate...) are from the gsscon code. These use a separate logging mechanism - there's already an issue to fix this.
My preference would be to reduce the priority for the last couple messages now - that will drop the data quantiy by about half. That puts the data rate to about 17 hours/GB for 100 conn/sec which is perhaps slow enough for an admin to notice. I would prefer to deal with the gsscon logging changes systematically.
I'd also like to schedule an audit of logging priorities throughout the system - it's somewhat haphazard at the moment.
Also, I'm puzzled - the last message ("incoming connection failed to auth") is output with "info" level and has been for a long time - I don't think it should show up at the "alert" level.
I just tried the alert
loglevel again, and the "incoming connection failed to auth" line isn't there. Here's what I get instead:
==> /var/log/trust_router/default.log.201806011804 <==
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection
That's now 227 bytes per message, which will take ~13.1 hours to fill up one gigabyte of disk space at a rate of 100 connections/second.
I'm going to mark this as tested to verification. If this becomes a problem at some other time, we can address it again.
The QA system ran out of disk space over the weekend, in large part due to excessive logging. We're getting around 500 of these messages a second: