tagomoris / fluent-plugin-secure-forward

Other
140 stars 30 forks source link

Multiple destinations with out_copy - cert verification fails #42

Open dstoy53 opened 8 years ago

dstoy53 commented 8 years ago

I couldn't find any previous posts about the issue I'm experiencing, so I'm hoping to find out if it's a bug or a PEBKAC situation.

I'm attempting to securely forward logs in the following manner:

  1. Logs hit internal server "syslog01" using in_syslog and get tagged
  2. Logs are forwarded from "syslog01" to "fluentd01" using out_secure
  3. On "fluentd01" I have a match statement using out_copy, with statements for the following destinations: a. One copy is being stored locally for debug purposes until everything works as intended b. One copy is being forwarded to "splunkfwd01" via out_secure c. One copy is supposed to be forwarded to "efk01" via out_secure

The issue that I'm running into is that if I have both of the out_secure destinations in the config file at step 3, only the first one is able to establish the SSL connection. The second one errors out by failing the SSL verification.

If I comment out one of the 2 out_secure destinations and forward the logs only to one destination at a time (either "splunkfwd01" or "efk01") the logs are forwarded successfully. This tells me my certs/shared secret/passphrase are accurate for either combination of fluentd01->splunkfwd01 or fluentd01->efk01. I am using a separate cert/key pair for each connection.

2016-07-20 15:44:12 -0400 [warn]: failed to establish SSL connection error_class=OpenSSL::SSL::SSLError error=#<OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=error: certificate verify failed> host="10.10.10.54" address="10.10.10.54" port=24285

(I tried using a different port for "efk01", 24285, as a troubleshooting step)

To sum it up, the behavior I'm seeing is that for a given out_copy match I can only use one out_secure store at a time.

Am I just missing something blatantly obvious?

Edit: "fluentd01" is running fluentd 0.12.20, while "splunkfwd01 and "efk01" are running fluentd 0.12.26. All servers are running 'fluent-plugin-secure-forward' version '0.4.2'. Edit2: The certificates were generated using 'secure-forward-ca-generate' from this plugin. They work fine with a single connection from either fluentd01->splunkfwd01, or for a single connection from fluentd01->efk01. The issue only occurs when I try to forward traffic to both splunkfwd01 and efk01 under the same match statement with out_copy.

tagomoris commented 8 years ago

It is the first time to hear about such behavior... I have no idea about root cause for now. Could you paste your configuration on "fluentd01" node? (Of course, without your secret values)

dstoy53 commented 8 years ago

I've attached the sanitized config for fluentd01. For fluentd01->splunkfwd01 the traffic is using the default 24284 port, and for fluentd01->efk01 I hard-set both sides to 24285 as a troubleshooting attempt. FirewallD is running on all VMs with the appropriate ports opened and listening. All of the VMs are running CentOS7 with selinux set to enforcing. All VMs are on the same subnet in a test environment, and I'm using the same shared secret and passphrase throughout the environment (the shared secret itself is a different value than the passphrase however).

Ultimately with this exact config, the relevant logs are received via in_secure at fluentd01, then stored locally and forwarded successfully to splunkfwd01 - but they are not being forwarded to efk01 due to the error I posted from the logs. If I comment out the store statement related to splunkfwd01, then my logs have no issues reaching efk01.

Thank you for looking into this.

fluentd01_sanitized.txt

tagomoris commented 8 years ago

Hmm, your configuration looks correct to work well - no explicit bad points at least. OK, I'll investigate about it at the next time I get enough time for it. It might require some time... sorry for it.

dstoy53 commented 8 years ago

I did some more testing and removed out_copy from the equation. Now I'm just using 2x match statements.

Scenario 1: fluentd01 -> splunkfwd01 - logs are forwarded, no errors Scenario 2: fluentd01 -> efk01 - logs are forwarded, no errors Scenario 3: fluentd01 -> splunkfwd01 AND efk01 - certificate verification for the second match statement fails (efk01 in this case) Scenario 4: fluentd01 -> splunkfwd01 AND efk01 while using the certificate from efk01 on all 3 servers - everything works with no errors Scenario 5: Same parameters as Scenario 4, except for the second match statement I set the 'ca_cert_path' to fluentd01's own cert file which is not present on either splunkfwd01 or efk01 - everything works fine with no errors

Based on these results the problem is occurring because I'm trying to use a different certificate for each connection. It seems that the plugin uses the same certificate from the first match statement on any subsequent match statements (at least that's what Scenario 5 leads me to believe).

So if I use the same certificate throughout my environment I would have no issues forwarding the traffic, at the expense of security.

tagomoris commented 8 years ago

I tested your situation with the configurations cert_copy_client, cert_copy_server_a and cert_copy_server_b in the branch of pull-request below: https://github.com/tagomoris/fluent-plugin-secure-forward/pull/45

As the result, 2 different CA certs in 2 <store> sections of a copy plugins, works well in my environment.

2016-07-29 13:36:24 +0900 [info]: using configuration file: <ROOT>
  <source>
    @type forward
  </source>
  <match test.**>
    @type copy
    <store>
      @type "secure_forward"
      secure yes
      self_hostname "client"
      shared_key xxxxxx
      ca_cert_path "/Users/tagomoris/github/fluent-plugin-secure-forward/example/cacerts1/ca_cert.pem"
      enable_strict_verification yes
      flush_interval 1s
      <server>
        host "localhost"
        port 24284
        hostlabel "server_a.local"
      </server>
      <buffer tag>
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 1s
      </buffer>
    </store>
    <store>
      @type "secure_forward"
      secure yes
      self_hostname "client"
      shared_key xxxxxx
      ca_cert_path "/Users/tagomoris/github/fluent-plugin-secure-forward/example/cacerts2/ca_cert.pem"
      enable_strict_verification yes
      flush_interval 1s
      <server>
        host "localhost"
        port 24285
        hostlabel "server_a.local"
      </server>
      <buffer tag>
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 1s
      </buffer>
    </store>
  </match>
</ROOT>
2016-07-29 13:36:24 +0900 [info]: listening fluent socket on 0.0.0.0:24224
2016-07-29 13:36:24 +0900 [info]: connection established to localhost
2016-07-29 13:36:24 +0900 [info]: connection established to localhost
tagomoris commented 8 years ago

@l-53 Can you try these configurations and CA cert files in your environment?

dstoy53 commented 8 years ago

No luck with the new certificates either.

Here are my current destinations from fluentd01:

  1. 10.10.10.52 - influxdb01 (using your cacert1 cert/key/psk/passphrase)
  2. 10.10.10.54 - efk01 (using your cacert2 cert/key/psk/passphrase)

So far it seems that every time I reload td-agent, the ssl session is established to either influxdb01 or efk01, but only one server at a time while the other one fails with the same certificate verification error.

If I comment out the config for influxdb01 I have no issues connecting to efk01. If I comment out the config for efk01 I have no issues connecting to influxdb01.

This tells me I didn't typo/mismatch the cert/psk/passphrases anywhere. The only notable differences I can spot between your test and my environment is that you tested on the same host rather than across servers, and your config uses strict verification.

I've attached my relevant sanitized config.

efk01_sanitized.txt fluentd01_sanitized.txt influxdb01_sanitized.txt

tagomoris commented 8 years ago

@l-53 Could you paste your logs of fluentd01? (debug(-v) or trace(-vv) logs if possible)

dstoy53 commented 8 years ago

I've attached the sanitized logs with -vv. The "SSLErrorWaitReadable" error for the successfully established connection only shows up with -vv.

fluentd01_logs_sanitized.txt

tagomoris commented 8 years ago

@l-53 Hmm, it looks just normal (and it looks to report certificate error). I'm very sorry to bother you, could you upload 2 more logs for commented-out pattern for 1st and 2nd <store> sections?

dstoy53 commented 8 years ago

I've attached the logs from fluentd01 with one section commented out at a time.

fluentd_to_efk.txt fluentd_to_influx.txt

tagomoris commented 8 years ago

Super weird... Could you tell me your versions of Ruby, td-agent (if you are using it) and OpenSSL on fluentd01?

dstoy53 commented 8 years ago

fluentd01's version-manifest.txt: ruby: 2.1.8 (embedded, no ruby installed on the system itself) td-agent: 2.3.1 openssl: 1.0.1r

influxdb01's version-manifest.txt: ruby: 2.1.10 (embedded, no ruby installed on the system itself) td-agent: 2.3.2 openssl: 1.0.1t

efk01's version-manifest.txt: ruby: 2.1.10 (embedded, no ruby installed on the system itself) td-agent: 2.3.2 openssl: 1.0.1t

I agree it's strange, that's why I initially thought I must've been doing something wrong along the way.