PrematureCloseException on HTTP requests from datacollectors

noi-techpark / bdp-core

Open Data Hub / Timeseries Core

https://opendatahub.com

Other

9 stars 4 forks source link

PrematureCloseException on HTTP requests from datacollectors #262

Closed dulvui closed 1 year ago

dulvui commented 1 year ago

Sometimes some data collectors get this Exception when pushing data or requesting the date of last record from the writer endpoint: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response It happens randomly and mostly when many requests are made in little time, like the traffic-a22-elaborations data collector.

After searching online it seems that It could be some wrong configuration or missing parameters of AWS loadbalancers, the tomcat server (int this case the tomcat docker container) or Java Options like mentioned here https://github.com/reactor/reactor-netty/issues/1764

clezag commented 1 year ago

duplicate https://github.com/noi-techpark/bdp-core/issues/258

clezag commented 1 year ago

After analyzing the log files, this correlates strongly with configuration reloads of our apache proxy, triggered by the let's encrypt daemon, usually every day at around 01:03. The actual technical reason might be a race condition when keepalive connections shut down prematurely, but the client is already transmitting.

To solve this issue, we could take a few (non-exclusive) approaches:

implement retry logic on the data collector side (as requested in #258)
investigate why the daemon does multiple reloads and reduce it to a single one
update apache to a newer version that might resolve the issues (https://github.com/noi-techpark/infrastructure/issues/68)

clezag commented 1 year ago

The reason for the multiple reloads every night were out of date letsencrypt config files for sites that either do not exist anymore, or are hosted by a different webserver. The letsencrypt daemon tried to renew all these certificates (~50) every night, leading to ~100 reloads, during which the errors occurred.

I have removed all the failing renewal configurations, so now reloads will be reduced to a necessary minimum (twice every time a certificate is about to expire and is renewed).

This should hopefully resolve the "premature close" errors, or at least make them very improbable until we've updated out proxy setup

clezag commented 1 year ago

As expected, the incidence of this issue has decreased significantly, but it still happens once in a while, 5 times within the last week. Strangely, three of the instances happened outside of any apache reload times (as far as I can tell). KQL query to verify: json.level : "ERROR" and json.message: *Premature*