When the lazymc status server errors, lazymc-docker-proxy becomes unhealthy and parks the logging thread forever

lynlevenick commented 1 month ago

I’m getting logs that read

ERROR hostname::lazymc             > Closing connection, error occurred
DEBUG lazymc-docker-proxy::health  Setting health status to: UNHEALTHY
ERROR lazymc-docker-proxy::health  Error detected. Pausing application...

which I believe to be coming from lazymc’s status server. It emits an error whenever there’s an error reading the initial packet from the client (code here). I believe that this is a completely recoverable error in lazymc, where the error serving this status can be safely ignored. Unfortunately, lazymc-docker-proxy dutifully reads that an error has occurred and immediately parks the logging thread, which prevents any further logs from being emitted and may or may not cause the failure to connect my users are seeing, which I have little visibility into because I don’t have logs.

I believe you could reproduce this by opening a connection to lazymc-docker-proxy’s lazymc server and closing it immediately or leaving it open without sending any packets until timeout, though I haven’t had the chance to personally check this.

joesturge commented 1 month ago

@lynlevenick thanks for the raising this issue. my assumption was if the container goes unhealthy docker would restart but it appears that this is not actually the case. What should happen in this case do you think?

lynlevenick commented 1 month ago

If you kill PID 1, the container will behave according to its restart policy. However, since the behavior is now to kill all running servers when the application starts, that could cause a chain of events like:

A user is connected to server A
A malformed connection is made to server B
lazymc-docker-proxy sees the error and restarts
As it restarts, lazymc-docker-proxy turns off all servers it manages, including server A with the live player

There’s a possibility of restarting the lazymc instance that the message is coming from, but I’m not sure if this error (or other recoverable errors) can occur while players are live on the server. If it can, I believe that could drop players as well.

It might be better to just ignore this particular error? If you do this though there’s always the possibility that another new one gets added or that this was insufficient. I think if I were doing it I would watch the lazymc process under the assumption that if an unrecoverable error occurs it would close itself, then restart it as needed, but I haven’t fully read through the lazymc source code.

joesturge commented 1 month ago

@lynlevenick ive got a pr to make it not park the thread here: https://github.com/joesturge/lazymc-docker-proxy/pull/112 please review, I think try it with this when it is released to see what the following logs occur afterwards

lynlevenick commented 1 month ago

Sounds good, if you publish an image I can get you some logging.

joesturge commented 1 month ago

@lynlevenick v2.5.2 is live with this change, let me know what happens

lynlevenick commented 1 month ago

Apologies for the delay - I have logs!

 TRACE a::lazymc::monitor                   > Fetching status for 172.18.0.4:25566 ... 
 TRACE b::lazymc::monitor                   > Fetching status for 172.18.0.3:25565 ... 
 TRACE a::mio::poll                         > registering event source with poller: token=Token(1761607681), interests=READABLE | WRITABLE
 TRACE b::mio::poll                         > registering event source with poller: token=Token(1761607681), interests=READABLE | WRITABLE
 ERROR b::lazymc                            > Closing connection, error occurred
 DEBUG lazymc-docker-proxy::health          > Setting health status to: UNHEALTHY
 ERROR lazymc-docker-proxy::health          > Application is unhealthy.
 TRACE b::mio::poll                         > deregistering event source from poller
 TRACE a::mio::poll                         > deregistering event source from poller
 TRACE a::lazymc::monitor                   > Fetching status for 172.18.0.4:25566 ... 
 TRACE a::mio::poll                         > registering event source with poller: token=Token(1778384897), interests=READABLE | WRITABLE
 TRACE b::mio::poll                         > deregistering event source from poller
 TRACE b::lazymc::monitor                   > Fetching status for 172.18.0.3:25565 ... 
 TRACE b::mio::poll                         > registering event source with poller: token=Token(1778384897), interests=READABLE | WRITABLE
 TRACE b::mio::poll                         > registering event source with poller: token=Token(318767109), interests=READABLE | WRITABLE
 TRACE b::mio::poll                         > deregistering event source from poller
 TRACE b::lazymc::monitor                   > Fetching status for 172.18.0.3:25565 ...

It seems like this error is recoverable - the lazymc instance continues working after logging that error. My players were able to join and have the server start as desired.

joesturge commented 1 month ago

thats, good, perhaps i should just stop it from setting unhealthy altogether if any error occurs, your right that we don't want to get into adding many special cases and run the risk of missing something, removing it will subvert this risk

joesturge commented 1 week ago

Im going to close this issue now, if you have any other problems please open a new issue, cheers!

joesturge / lazymc-docker-proxy

When the lazymc status server errors, lazymc-docker-proxy becomes unhealthy and parks the logging thread forever #110