mp911de / logstash-gelf

Graylog Extended Log Format (GELF) implementation in Java for all major logging frameworks: log4j, log4j2, java.util.logging, logback, JBossAS7 and WildFly 8-12
http://logging.paluch.biz
MIT License
424 stars 101 forks source link

Allow TCP and UDP sender creation for unresolvable hostnames #252

Closed MarkLehmacherInvestify closed 2 years ago

MarkLehmacherInvestify commented 3 years ago

I am running a bunch of microservices in a kubernetes cluster. The microservices are using the log4j2 logstash-gelf appender to emit log events to a remote logstash service.

When the microservices are started before the logstash service is available, the appender never actually tries to reconnect to logstash.

I get one error on stdout at service start: main ERROR Unknown GELF server hostname:tcp:shared-logstash?readTimeout=10s&connectionTimeout=2000ms&deliveryAttempts=5&keepAlive=true

From then on get the following error for each individual log event: Log4j2-TF-1-AsyncLoggerConfig-1 ERROR Could not send GELF message

When logstash service is available within the cluster and I restart the microservice at this point, the appender connects and everything works as expected.

However, I would expect the appender to eventually retry connections without having to restart the whole microservice container. What is the expected implemented/specified behavior here?

I am using 1.14.1.

MarkLehmacherInvestify commented 3 years ago

Apparently at least one the potential gelf sender providers performs a synchronous host lookup at creation time, which in turn results in a UnknownHostException which is being reported from within GelfSenderFactory. The appender is subsequently left without a gelf sender and reports the "Could not send GELF message".

The code clearly did not anticipate modern cloud environments with very dynamic dns entries ;)

mp911de commented 3 years ago

DNS lookups are expensive, that's why we decided to do the lookup once during startup. I'm not sure that modern describes something that causes more problems than it solves.

Do you have a suggestion how to enable dynamic dns lookups without introducing a performance penalty to users that don't require such a functionality?

MarkLehmacherInvestify commented 3 years ago

Well, as far as I see it there is no going backwards with regards to cloud environments. No matter our personal opinions on that ;)

As far as I understand the code right now there are actually several causes which can leave the appender without a sender, in which case it will never recover from failure. The first decision is between two choices:

  1. avoid having the appender end up without sender, this probably means deferring the creation of the socket (I am looking at the tcp code) until "later"
  2. make sure the appender has some way to recover from a missing sender

I am afraid I am not really qualified to come up with a quick solution however :(

hartman commented 3 years ago

I experienced a similar problem. We had an entire virtual machine cluster in one of our DCs go down, and apparently the app+gelf logger was started before some other elements had fully recovered. The namelookup failed and was never tried again and we got a continuous stream of "Could not send GELF message" instead, requiring the service to be restarted.

mp911de commented 2 years ago

We can turn the hostname lookup into a warning when it fails.