Closed tm512 closed 1 year ago
You can write lichess-bot log messages to a file with -l filename.txt
and add verbosity with -v
. I'm interested in seeing these logs since this is the first I've heard of this problem. Please upload the entire log from start to OOM error to https://gist.github.com/.
Looking at the log it appears that lichess-bot is sending the stop
UCI command but not following it with quit
like it normally does. As soon as the engine gets stop
it responds with bestmove
, so it seems like it might have something to do with the game being ended mid-search? The log is huge and cluttered by output from a concurrent game so I'm restarting it with concurrency set to 1 so I can get a clean log snippet to paste if the additional log information is necessary.
As a first try at fixing this issue, we've added the isready/readyok
query between the final stop
and quit
commands. If the problem is caused by the communication channel being shutdown before the engine can finish responding to the stop
command, this may help.
Still happens even after 73577de9bfe8501e23452570a2445f01721e939a. Here's a gist of the problem happening where it leaks two instances in succession: https://gist.github.com/tm512/7b9dc09872c12af9358afc005117afa8
It kinda seems like the connection to lichess gets interrupted, causing lichess-bot to spawn a new engine instance without sending quit
to the previously spawned one, considering all of these are spawned for the exact same game ID.
Can you try #627? Also, can you try the master branch but with the backoff
statement at the start of play_game
removed (line 409) and post the log? I want to see where the error comes from. Making sure that the engine is closed is a good change which can prevent something similar from happening in the future, but your bot got this error regularly which shouldn't be happening.
Took a while for the bug to occur with the backoff
statement commented out but here it is: https://gist.github.com/tm512/f94c5910d93e195fcc3fd68f930d1ece
I'll give #627 a shot now and give it a few days.
Did your logs show excess memory usage or zombie processes before the error?
PID 10428 featured here was the first one that got leaked after commenting out the backoff
statement. Haven't had issues with excess memory consumption since I've been restarting lichess-bot frequently for the sake of testing, and all of these leaked processes do get killed along with lichess-bot.
Your engine didn't follow the searchtime
command. In the logs it is told to search for 10 seconds but it searches for 20 until the stop
command stops it (21:30:22-21:30:42), so I assume the game was aborted, which is why stop was sent, but I don't know if this caused the error. It doesn't look like it was caused by lichess-bot. Maybe we should open an issue in python-chess
. @MarkZH What do you think caused the error?
I think that the first cause of the error was an asyncio.exceptions.TimeoutError
. This was thrown from the python-chess package at this line. The timeout was set to 20 seconds in the self._timeout_for(limit)
where limit
is the game clock settings (10 seconds for a move in this case) plus a default 10 seconds from the SimpleEngine
constructor. I might be wrong about this because the error message in the log says that the TimeoutError
was caused by a previous exception: the asyncio.exceptions.CancelledError
. I'm having trouble following the origin of that error in the communicate()
function in python-chess.
In any case, the exception causes the play_game()
function to exit without calling engine.quit()
, causing the engine process to never end, which probably causes the game thread to never terminate.
I can see at least two options:
timeout=None
argument in the engine constructors here and here. This would allow the engine to complete whatever it is doing no matter how long ago the clock ran out.asyncio.exceptions.TimeoutError
here. This would allow for timely reception of the next game stream event. The engine would be stopped 10 seconds after the game clock ran out.I prefer the 2nd way, since the engine could have bugs preventing it from stopping.
The source of the error seems to be that the engine doesn't have searchtime
implemented. tm512 has an opening book to avoid this but when the opponent plays uncommon first moves (g2g4
in the last log, or h2h4
in the first log) the opening book doesn't have a move and the engine enters an infinite search (as can be seen here, because max_time
is 0
in uci.cpp
). What @tm512 can do is open the book in a GUI (e.g. Scid vs PC) and add a response for all possible white moves, or implement a basic searchtime
command (e.g. set max_time
to searchtime
, even though the engine would search for less than searchtime
time).
I just added a PR that should directly solve this problem. @tm512, please test and report if it properly handles aborted games.
I've run #627 for a few days and that does seem to prevent the process leaks. I can switch over to trying the changes implemented in #628 instead, but if I implement go movetime
in my engine would that mess with testing that change? I can delay implementing it if necessary. Since the opening book handles things in most situations, it slipped past me that lichess-bot was relying on this parameter, so that's my bad. Most of the UCI code was written back in 2014 and only really covered what was necessary to have it playing through polyglot + icsdrone on FICS.
If you want to test #628, you don't have to delay long. Challenge your bot to a game with your own account and make the same opening move (g4) as the last game that resulted in timing out.
if I implement go movetime in my engine would that mess with testing that change?
Yes. You have to use the version without movetime.
I can confirm that with the latest HEAD on the master branch, the process gets shut down properly in instances of timeout. I'll get around to implementing go movetime
soon to prevent this from occurring in the first place.
Since setting up lichess-bot, I've noticed that my server eventually becomes unresponsive, forcing me to hard reset, with OOM errors spammed into my system logs. Monitoring the system using top, I notice that lichess-bot will occasionally spawn but fail to kill instances of my chess engine, often in pairs with sequential PIDs which makes me think it's spawning multiple engines at once.
I'm not entirely sure what kind of situation causes this, if anything in particular, since I've failed to catch it happening in real-time to see what lichess-bot prints out to the console, but I can update this with additional info if I finally manage to catch it. I'll probably start piping the output through tee to log everything.
If it's relevant, I'm running lichess-bot on Python 3.9.14 on DragonflyBSD.