RELP not reconnecting automatically

saliceti commented 7 years ago

We use rsyslog to send Cloud foundry messages to logstash via RELP. This is done via the syslog and logsearch releases. The whole is deployed on AWS and our logstash servers are behind an ELB. We've noticed a problem when the logstash servers disconnect, for example by temporarily removing the ELB backends. When rsyslog reconnects, it tries to open a new RELP session and keeps getting rejected by logstash. This is what we keep seeing in the logstash log:

{:timestamp=>"2016-11-07T14:05:47.456000+0000", :message=>"Relp error: Relp::InappropriateCommand open expecting syslog", :level=>:warn}

It looks like it comes from this line.

From tcpdump:

3 open 86 relp_version=0
relp_software=librelp,1.2.12,http://librelp.adiscon.com
commands=syslog
4 open 86 relp_version=0
relp_software=librelp,1.2.12,http://librelp.adiscon.com
commands=syslog
5 syslog 533 <14>2016-11-07T13:30:48.176713+00:00 10.0.16.11 vcap.nats [job=nats index=0]  {"timestamp":1478525448.176448,"source":"NatsStreamForwarder","log_level":"info","message":"router.register","data":{"nats_message": "{\"host\":\"10.0.32.4\",\"port\":60064,\"uris\":[\"paas-dashboard.colin-oct.dev.cloudpipelineapps.digital\"],\"app\":\"9d74e745-cd32-4f9b-9952-63b6367eb012\",\"private_instance_id\":\"4c95375e-4328-4f2e-56d3-3a09fdeba95d\",\"private_instance_index\":\"0\",\"tags\":{\"component\":\"route-emitter\"}}","reply_inbox":null}}

rsyslog tries to open many connections and consumes a lot of CPU (up to 80%). Restarting rsyslog solves the issue. When we use TCP instead of RELP transport, there is no issue.

We're on logstash 2.3.3 and logstash-input-relp 2.0.5. Any idea?

jordansissel commented 7 years ago

I'm not well experienced with RELP, but after reviewing your report and reading the RELP spec, I think I see the problem --

It appears that your rsyslog server is sending two open commands for some reason. Can you attach a pcap for both network paths (rsyslog -> logstash, logstash -> rsyslog). The snapshot from tcpdump you sent appears to be only from rsyslog (the client sending logs).

It's unclear to me whether this is an rsyslog or logstash problem, but I believe it is a problem that rsyslog sends open twice, though I'm not sure why it's doing that.

flysen commented 7 years ago

I experience the same problem when sending RELP from rsyslog to logstash-->elastic. If I shut down elastic the pushback for tcp should kickin and start queue up on either logstash|rsyslog-server|client. So, if you don't have proper queue (mainqueues and actionqueues) in rsyslog you will end up in a really slow syslog (by design syslog slows down when not able to process messages). I thought that was my case, but having my queues reviewed by Adiscon GmbH (rsyslog company) and the problem continues.

saliceti commented 7 years ago

I'll try to reproduce the issue in a simpler setup than Cloud Foundry as it adds a lot of noise from all the VMs.

SuperFunks commented 7 years ago

I believe this is the same as my findings on trying to loadbalance logstash relp input behind HAProxy. Somehow the input module fails to renegotiate the connection - and we end up with no logflow.

logstash-plugins / logstash-input-relp

RELP not reconnecting automatically #18