Secure input performance and limits

jonapich commented 7 years ago

I've been troubleshooting an issue for the past few weeks and I'd like to share some conclusions/questions.

We wrap Fluentd in docker using fluent/fluentd:v0.12-latest-onbuild. The aggregator's configuration is rather simple:

<source>
  @type secure_forward
  self_hostname ...
  cert_path ...
  private_key_path ...
  private_key_passphrase ...
  shared_key ...
  port 24284
  secure yes
  skip_invalid_event yes
  @label @elasticsearch
</source>

<label @elasticsearch>
  <filter **>
    type tidyup
    target_index_key target_index
    time_key @timestamp
  </filter>
  <match **>
    @type elasticsearch
    (redacted for brievity - 10m max file buffers and flushes every 7s)
  </match>
</label>

The tidyup script is also very simple:

    def filter(tag, time, record)
      timestamp = Time.at(time).to_datetime
      record[@target_index_key] = [tag, timestamp.strftime('%Y-%m-%W')].join("-").gsub(/[.\/]/, '-')

      if record.has_key? @time_key
        record[@time_key] = parse_time(record[@time_key], time, tag).iso8601(3)
      else
        record[@time_key] = timestamp.iso8601(3)
      end

      return record
    end

The server is dedicated to Fluentd. For my tests, I used an r4.large instance from AWS. They 15gb of RAM, a good CPU (2 cores) and better-than-average network bandwidth.

Under normal operation, there are about 350 fluentd threads and using around 30% CPU.

I have stressed-tested this server by creating tons of idle connections from my developer machine (and some through AWS lambda too). When the thread count reaches 800+, clients start receiving timeout errors and are eventually unable to forward logs to the aggregator. Apparently, the ElasticSearch output plugin also halts and the CPU hits and sustains 100% (on the main fluentd process). To be extra clear: none of my idle connections are actually sending logs. They just perform the handshake and shared key exchange, and then they idle out. The Load Balancer eventually kicks the instance out since it cannot reply in a timely manner (health check).

I confirmed that the load balancer is not part of the problem. On the contrary, I was able to reduce the load on the server by configuring the load balancer to kill any idle connection after 10 seconds, but this causes clients to perform a lot more handshakes and is not ideal from my point of view.

I also isolated a server just for the tests (meaning, the 350 real clients are not hitting it at all and no load balancer either). I created about 2k idle connections. Then I tried sending logs with one of them, and basically received timeouts 80% of the time (tried 5 and 10 seconds timeouts, didn't really affected the results).

A few weeks prior, we isolated every fluentd plugin in its own process. That's why i know the secure input plugin is the culprit. The elastic search output was in its own process and never showed any sign of fatigue, nor CPU usage, and doesn't spawn any threads other than the ones it spawns on init.

So, my conclusion is that a connection to the secure input plugin, even if completely idle, is very expensive, and too much can bring a server down. Is this expected? Are you surprised by these performance issues, or is it about right?

I'd be interested to try this scenario outside of docker if you think the behavior I'm seeing is wrong...

I have graphed the buffer size VS thread count for the server during my tests, going through the load balancer, configured to kill any idle connection after 5 minutes:

The normal thread count is about 350 and that was true around 15:15 (not on the graph). Then I started flooding the server (those red spikes that disappear are AWS lambdas hard-stopping after 5 minutes). So, dead connections are correctly being cleaned up, that's awesome. At the same time, my local dev machine was creating connections like crazy and also trying to push logs at the same time. The green line shows the buffer size on disk. Typically, it moves all the time, but you can see on the graph that at 15h40, it pretty much stopped moving: ElasticSearch's plugin stopped forwarding logs, and Secure Input stopped accepting logs from clients too. When I killed all my tests around 15h50, we can see elastic started working again since the buffer size starts moving again. The red line drops slowly; that's the 350 real clients trying to reconnect and send their backlogs! It then took more than 30 minutes to return to normal operations.

To reduce the load, we're changing the exception handling of our clients to tolerate a certain amount of timeouts before attempting to recreate a connection (they will also be limited in the amount of connections they are allowed to create in time).

Thank you! And sorry for the long post :)

jonapich commented 7 years ago

This morning I setup a Fluentd server that uses in_forward instead of secure_forward, and tried flooding it again with idle connections.

First thing I noticed is that the thread count is stable, regardless of the amount of connections.

Second thing I noticed, using secure-forward, creating 100 idle connections creates about 16k file descriptors on the host machine. In comparison, 100 idle connections using in_forward creates about 700 file descriptors... Is that expected?

As far as the SSL overhead goes, creating 100 connections takes me less than 4 seconds in TCP but close to 35 seconds in SSL. After some tracing, I can say that my hash/salt/etc routines takes up about 0.065 seconds. Receiving the pong response seems to be the longest operation (0.2 seconds). That's probably normal, but I'll let you be the judge :)

tagomoris commented 7 years ago

Thank you for benchmarking & detailed report!

So, my conclusion is that a connection to the secure input plugin, even if completely idle, is very expensive, and too much can bring a server down. Is this expected?

It is completely expected. The secure forward input plugin creates threads for every connections, and these threads shares just 1 CPU core by CRuby's global vm lock. From this implementation, secure input plugin has a limitation about scaling (especially for the number of connections).

On the other hand, Fluentd v0.14's TLS support uses asynchronous I/O instead of threading. So that implementation can be much more scalable for connections than secure forward input plugin.

tagomoris commented 7 years ago

I didn't expect about the increase of the number of file handles, but i'm not surprised about it... TLS transferring uses, as you know, many resources and different types of buffers internally.

The longer time to establish 100 connections in TLS than TCP is also expected. I suppose that TLS connection establishment (it's done BEFORE secure-forward handshake, out of this plugin's ruby code) consumes much longer time than TCP. Once TLS connection established, sending/receiving bytes via TLS connections doesn't take so longer time than TCP connections.

jonapich commented 7 years ago

Thanks,

We're very eager to try out v14 once it's out of development. Until then we'll modify our clients to be more tolerant for connection issues (i.e. limit the amount of connections they can create per minute) and we'll spawn more servers to handle the load.

tagomoris / fluent-plugin-secure-forward

Secure input performance and limits #57