meadmaker commented 6 years ago

The QA system ran out of disk space over the weekend, in large part due to excessive logging. We're getting around 500 of these messages a second:

May 29 13:57:47 ip-172-31-26-6 trust_router[4898]: tr_sock_accept: Unable to accept connection: errno = 24
May 29 13:57:47 ip-172-31-26-6 trust_router[4898]: trp_connection_accept: Error accepting connection.

jennifer-richards commented 6 years ago

I can turn down the logging on those messages to "debug" instead of "err" and "notice".

I think the incoming connections are unlikely to come from a trust router - though it might be possible if the trp_connect_interval were set to 0. They default to 10 seconds or more, so this would require a configuration to explicitly set them to something ridiculous. I will go ahead and add validation to ensure that the various intervals are set to reasonable values.

Sound reasonable?

jennifer-richards commented 6 years ago

I've implemented the above changes in #92.

I don't know if there's a great way to test this - the particular error above (errno = 24) happens when the process has hit its cap on open files. If we want to see that, probably we can set that to a small number with ulimit (though that may interfere with trust router operation; it can probably be tweaked to work). The goal is that it should not emit a log message on a failed "accept" unless logging is set to "debug".

jennifer-richards commented 6 years ago

Test

This at least tests that we don't add log entries for failed connections except in debug mode.

Steps

Configure a trust router with its logging levels set to "info" or higher in its internal configuration. Start the trust router
Connect to the TRP port (12308 or whatever is configured) with telnet, gss-client, or some other non-trust router client
Examine the trust router output (result A)
Reconfigure the trust router to use "debug" level logging
Repeat steps 2-3 (result B)

Expected results

Result A: there should be no message in the log about a failed connection
Result B: there should be a message indicating a failed connection

meadmaker commented 6 years ago

Log messages at debug level:

==> default.log.201805312043 <==
tr_sock_accept: Incoming connection on fd 19 from 23.30.188.246

==> /var/log/messages <==
May 31 20:58:32 ip-172-31-47-134 trust_router[1431]: tr_sock_accept: Incoming connection on fd 19 from 23.30.188.246

==> default.log.201805312043 <==
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection
tr_trps_process_mq: incoming connection failed to auth.

==> /var/log/messages <==
May 31 20:58:33 ip-172-31-47-134 trust_router[1431]: trps_authorize_connection: failed to authorize connection
May 31 20:58:33 ip-172-31-47-134 trust_router[1431]: tr_trps_process_mq: incoming connection failed to auth.

Log messages at alert level:

==> default.log.201805312059 <==
tr_sock_accept: Incoming connection on fd 21 from 23.30.188.246
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection
tr_trps_process_mq: incoming connection failed to auth.

Looking at the numbers for that, this one aborted connection generates 347 bytes of data. If we get DOS'd by 100 connections a second, then we're generating a GB of log data every 8.6 hours. Do we think that's acceptable?

meadmaker commented 6 years ago

I can confirm that the messages to /var/log/messages do not get generated when the loglevel is set to 'alarm'.

meadmaker commented 6 years ago

Assigning to @jennifer-richards and @mrw42 for opinions on my question about whether generating 347 bytes per aborted connection is acceptable.

jennifer-richards commented 6 years ago

It strikes me as borderline tolerable.

I can trivially get rid of the messages in the main TR code (the ones with method name prefixes on the messages). Actually, #92 gets rid of the first one (moves it to debug instead of err priority).

The other messages (ReadBuffer..., ReadToken..., and Authenticate...) are from the gsscon code. These use a separate logging mechanism - there's already an issue to fix this.

My preference would be to reduce the priority for the last couple messages now - that will drop the data quantiy by about half. That puts the data rate to about 17 hours/GB for 100 conn/sec which is perhaps slow enough for an admin to notice. I would prefer to deal with the gsscon logging changes systematically.

I'd also like to schedule an audit of logging priorities throughout the system - it's somewhat haphazard at the moment.

jennifer-richards commented 6 years ago

Also, I'm puzzled - the last message ("incoming connection failed to auth") is output with "info" level and has been for a long time - I don't think it should show up at the "alert" level.

meadmaker commented 6 years ago

I just tried the alert loglevel again, and the "incoming connection failed to auth" line isn't there. Here's what I get instead:

==> /var/log/trust_router/default.log.201806011804 <==
ReadBuffer failed: Connection reset by peer (err = 104)
ReadToken failed: Connection reset by peer (err = 104)
Authenticate failed: Connection reset by peer (err = 104)
trps_authorize_connection: failed to authorize connection

That's now 227 bytes per message, which will take ~13.1 hours to fill up one gigabyte of disk space at a rate of 100 connections/second.

I'm going to mark this as tested to verification. If this becomes a problem at some other time, we can address it again.

painless-security / trust-router

Excessive logging #89

Test

Steps

Expected results