Process leaks eventually resulting in OOM

tm512 commented 1 year ago

Since setting up lichess-bot, I've noticed that my server eventually becomes unresponsive, forcing me to hard reset, with OOM errors spammed into my system logs. Monitoring the system using top, I notice that lichess-bot will occasionally spawn but fail to kill instances of my chess engine, often in pairs with sequential PIDs which makes me think it's spawning multiple engines at once.

I'm not entirely sure what kind of situation causes this, if anything in particular, since I've failed to catch it happening in real-time to see what lichess-bot prints out to the console, but I can update this with additional info if I finally manage to catch it. I'll probably start piping the output through tee to log everything.

If it's relevant, I'm running lichess-bot on Python 3.9.14 on DragonflyBSD.

MarkZH commented 1 year ago

You can write lichess-bot log messages to a file with -l filename.txt and add verbosity with -v. I'm interested in seeing these logs since this is the first I've heard of this problem. Please upload the entire log from start to OOM error to https://gist.github.com/.

tm512 commented 1 year ago

Looking at the log it appears that lichess-bot is sending the stop UCI command but not following it with quit like it normally does. As soon as the engine gets stop it responds with bestmove, so it seems like it might have something to do with the game being ended mid-search? The log is huge and cluttered by output from a concurrent game so I'm restarting it with concurrency set to 1 so I can get a clean log snippet to paste if the additional log information is necessary.

MarkZH commented 1 year ago

As a first try at fixing this issue, we've added the isready/readyok query between the final stop and quit commands. If the problem is caused by the communication channel being shutdown before the engine can finish responding to the stop command, this may help.

tm512 commented 1 year ago

Still happens even after 73577de9bfe8501e23452570a2445f01721e939a. Here's a gist of the problem happening where it leaks two instances in succession: https://gist.github.com/tm512/7b9dc09872c12af9358afc005117afa8

It kinda seems like the connection to lichess gets interrupted, causing lichess-bot to spawn a new engine instance without sending quit to the previously spawned one, considering all of these are spawned for the exact same game ID.

AttackingOrDefending commented 1 year ago

Can you try #627? Also, can you try the master branch but with the backoff statement at the start of play_game removed (line 409) and post the log? I want to see where the error comes from. Making sure that the engine is closed is a good change which can prevent something similar from happening in the future, but your bot got this error regularly which shouldn't be happening.

tm512 commented 1 year ago

Took a while for the bug to occur with the backoff statement commented out but here it is: https://gist.github.com/tm512/f94c5910d93e195fcc3fd68f930d1ece

I'll give #627 a shot now and give it a few days.

MarkZH commented 1 year ago

Did your logs show excess memory usage or zombie processes before the error?

tm512 commented 1 year ago

PID 10428 featured here was the first one that got leaked after commenting out the backoff statement. Haven't had issues with excess memory consumption since I've been restarting lichess-bot frequently for the sake of testing, and all of these leaked processes do get killed along with lichess-bot.

AttackingOrDefending commented 1 year ago

Your engine didn't follow the searchtime command. In the logs it is told to search for 10 seconds but it searches for 20 until the stop command stops it (21:30:22-21:30:42), so I assume the game was aborted, which is why stop was sent, but I don't know if this caused the error. It doesn't look like it was caused by lichess-bot. Maybe we should open an issue in python-chess. @MarkZH What do you think caused the error?

MarkZH commented 1 year ago

I think that the first cause of the error was an asyncio.exceptions.TimeoutError. This was thrown from the python-chess package at this line. The timeout was set to 20 seconds in the self._timeout_for(limit) where limit is the game clock settings (10 seconds for a move in this case) plus a default 10 seconds from the SimpleEngine constructor. I might be wrong about this because the error message in the log says that the TimeoutError was caused by a previous exception: the asyncio.exceptions.CancelledError. I'm having trouble following the origin of that error in the communicate() function in python-chess.

In any case, the exception causes the play_game() function to exit without calling engine.quit(), causing the engine process to never end, which probably causes the game thread to never terminate.

I can see at least two options:

Set timeout=None argument in the engine constructors here and here. This would allow the engine to complete whatever it is doing no matter how long ago the clock ran out.
Catch the asyncio.exceptions.TimeoutError here. This would allow for timely reception of the next game stream event. The engine would be stopped 10 seconds after the game clock ran out.

AttackingOrDefending commented 1 year ago

I prefer the 2nd way, since the engine could have bugs preventing it from stopping.

The source of the error seems to be that the engine doesn't have searchtime implemented. tm512 has an opening book to avoid this but when the opponent plays uncommon first moves (g2g4 in the last log, or h2h4 in the first log) the opening book doesn't have a move and the engine enters an infinite search (as can be seen here, because max_time is 0 in uci.cpp). What @tm512 can do is open the book in a GUI (e.g. Scid vs PC) and add a response for all possible white moves, or implement a basic searchtime command (e.g. set max_time to searchtime, even though the engine would search for less than searchtime time).

MarkZH commented 1 year ago

I just added a PR that should directly solve this problem. @tm512, please test and report if it properly handles aborted games.

tm512 commented 1 year ago

I've run #627 for a few days and that does seem to prevent the process leaks. I can switch over to trying the changes implemented in #628 instead, but if I implement go movetime in my engine would that mess with testing that change? I can delay implementing it if necessary. Since the opening book handles things in most situations, it slipped past me that lichess-bot was relying on this parameter, so that's my bad. Most of the UCI code was written back in 2014 and only really covered what was necessary to have it playing through polyglot + icsdrone on FICS.

MarkZH commented 1 year ago

If you want to test #628, you don't have to delay long. Challenge your bot to a game with your own account and make the same opening move (g4) as the last game that resulted in timing out.

AttackingOrDefending commented 1 year ago

if I implement go movetime in my engine would that mess with testing that change?

Yes. You have to use the version without movetime.

tm512 commented 1 year ago

I can confirm that with the latest HEAD on the master branch, the process gets shut down properly in instances of timeout. I'll get around to implementing go movetime soon to prevent this from occurring in the first place.

lichess-bot-devs / lichess-bot

Process leaks eventually resulting in OOM #623