GameModel.py <-> CECPEngine.py race condition

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Start a game against a CECPEngine
2. Make 10 or 20 moves
3. Hold down Ctrl-Z to do 10 or 20 undos in succession quickly

What is the expected output? What do you see instead?
If you do it against PyChess engine, something like:
08:47:04 ('Python', '08:02:48.901') Warning: illegal move: line="Illegal move 
c1g5", board=Black kq -
r n b q k b . r 
p p . p . p p . 
. . . P p . . . 
. . Q . P . B p 
. . . . . . . . 
P . N . . N . P 
. P P . . K P . 
R . . . . B . R 

usually followed by a crash like (condensed -- full log is attached as 
self.done-assertion-error.log):

08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     mvs, val 
= alphaBeta (board, depth-1, -beta, -alpha, ply+1)
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsearch.py", line 
144, in alphaBeta
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsearch.py", line 
144, in alphaBeta
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:     else: moves = 
[(-getMoveValue(board,table,depth,m),m) for m in genAllMoves(board)]
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     else: 
moves = [(-getMoveValue(board,table,depth,m),m) for m in genAllMoves(board)]
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsort.py", line 
65, in getMoveValue
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsort.py", line 
65, in getMoveValue
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:     score = 
positionValues[fpiece][board.color][tcord] - \
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     score = 
positionValues[fpiece][board.color][tcord] - \
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error: KeyError: 0
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error: KeyError: 0

08:47:05 ('Python', '08:02:48.901') Error:     mvs, val = quiescent(board, 
alpha, beta, ply)
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     mvs, val 
= quiescent(board, alpha, beta, ply)
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsearch.py", line 
255, in quiescent
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lsearch.py", line 
255, in quiescent
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:     for move in genCaptures (board):
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     for move 
in genCaptures (board):
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lmovegen.py", 
line 233, in genCaptures
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Utils/lutils/lmovegen.py", 
line 233, in genCaptures
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error:     attackBoard = attack45 
[cord][ray45 [cord] & blocker] | \
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error:     
attackBoard = attack45 [cord][ray45 [cord] & blocker] | \
08:47:05 stdout Debug: 
08:47:05 ('Python', '08:02:48.901') Error: KeyError: 9241421619736346624L
08:47:05 stdout Debug: 08:47:05 ('Python', '08:02:48.901') Error: KeyError: 
9241421619736346624L

08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Players/PyChess.py", line 333, 
in run
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Players/PyChess.py", line 333, 
in run
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:     self.__stopSearching()
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:     
self.__stopSearching()
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Players/PyChess.py", line 387, 
in __stopSearching
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/Players/PyChess.py", line 387, 
in __stopSearching
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:     self.worker.get()
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:     
self.worker.get()
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/System/GtkWorker.py", line 
162, in get
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:   File 
"/home/gatto/code/svn/pychess-bugfix/lib/pychess/System/GtkWorker.py", line 
162, in get
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error:     assert self.isDone()
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error:     assert 
self.isDone()
08:47:06 stdout Debug: 
08:47:06 ('Python', '08:02:48.901') Error: AssertionError
08:47:06 stdout Debug: 08:47:06 ('Python', '08:02:48.901') Error: AssertionError

This might have been a cause of issue 523 comment 10 and discussed in r1633 
r1634 and r1635.
The cause of this can be seen here (copy+pasted from the PyChess Information 
Window):

10:38:35
------------------------------------------------------------
got moves Qd8 Qe1 Nxf2 Qxf2 from depth 3
CECPEngine.undoMoves: moves=1 board=White k -
q . . . k b . r 
p . . b p p p . 
. . . p . . . p 
. . . . . . . . 
. . . . P . n . 
. . N . . . . . 
P P P . . P P P 
R . B Q . R K . 
force
?
PyChess.__go: move=3643
PyChess.__go: sanmove='Qd8'
move Qd8
CECPEngine.undoMoves: moves=1 board=White kq -
r . . q k b . r 
p . . b p p p . 
. . B p . . . p 
. . . . . . . . 
. . . . P . n . 
. . N . . . . . 
P P P . . P P P 
R . B Q . R K . 
force
?
undo
undo

Notice that CECPEngine.undoMoves() is called again before it has finished the 
first time.
Forcing the code to get the CECPEngine.changedLock in semisynced() doesn't help
here because the first call to undoMoves() has to wait for the engine to make a
move in _blockTillMove(), and in the process has to give up the changeLock() 
which
provides an opportunity for the second call to undoMoves() to run through before
the first call finished.

This ends up in the engine being sent its commands in the wrong order and 
either a hang,
crash, or simply in the engine not responding anymore.

The solution I've come up with which seems to work so far in the last few weeks
that I've been testing it, is mainly to move @inthread decorator out of
CECPEngine.undoMoves() to GameModel.undoMoves() and then queue the calls to
GameModel.undoMoves().

A patch (which I've haven't completely finished doing code cleanup for) is 
attached.
Please ignore everything in the diff other than changes to @undosync
and the *.undoMoves() which are other fixes/additions and some method and 
variable
renaming crap that I'm too tired to separate out right now or which need to be 
cleaned
up (I'm very much looking forward to the move to Mercurial: hooray for cheap 
local copies
for fixing separate bugs).

Original issue reported on code.google.com by mattgatto on 5 Apr 2010 at 2:27

Attachments:

GameModel-CECPEngine-race-conditions.diff

GoogleCodeExporter commented 9 years ago

Original comment by mattgatto on 5 Apr 2010 at 2:28

Attachments:

self.done-assertion-error.log

GoogleCodeExporter commented 9 years ago

Kinda tricky bug. Good work wrapping it up.
Would it work to add a new lock around undoMoves? Which is not released in 
blockTillMove?
It is mostly because it would be nice if CECPEngine could work independendly of
GameModel. But if I have overseen something - which I probably have - then I 
think
your patch is a good fix, until we redesign that mess. Or drop support for cecp.

Btw., what is that Offer.offerType -> Offer.type rename :p It takes up a thid 
of the
patch ;)

Original comment by lobais on 6 Apr 2010 at 4:37

GoogleCodeExporter commented 9 years ago

The offer.type rename is part of a different set of changes which fixes some 
simpler offer handling 
stuff. I'm sorry for not properly separating out the changes for bugs 548-551, 
that was lazy of me, but I 
wanted to keep everyone updated on what I was working on and had been putting 
it off for too long and 
properly separating out all the different changes I was working on which were 
done on the same code 
tree (svn/pychess-bugfixes) was prohibitively time-consuming. Now that were 
using Mercurial though I 
don't see this happening anymore... or at least, now I don't have an excuse 
with cheap local copies ;).

Back when I was actively working on this bug, I did try the approach of adding 
a separate lock to 
CECPEngine.undoMoves first, and as far as I remember, it ran into deadlocks 
where the 
CECPEngine.undoMoves thread hung at 
CECPEngine.__blockTillMove:self.movecon.wait() because 
the move never came in via CECPEngine.__parseLine because the engine subprocess 
couldn't get 
either the glock (threads_enter) or the python GIL (I never figured out which) 
it needed to emit the line 
with the move. I even tried altering CECPEngine.__blockTillMove to make it give 
up the glock while 
waiting for the move, but that didn't work either (in that the subprocess 
engine still couldn't emit the 
line), maybe because we got there via a bunch of emits, and during an emit from 
the MainThread 
Python acquires the GIL ( 
http://www.pygtk.org/docs/pygtk/gdk-functions.html#function-gdk--threads-
init ). Basically, I came to the conclusion that the only way to do undos 
correctly and without 
encountering deadlocks in the current PyChess codebase is to do them without 
holding the glock 
which means doing them asynchronously of the user telling the program to do 
them through the GUI 
via "Action -> Offer Undo", which is how I arrived at the design in the patch 
of doing undos in a 
separate thread (via @inthread) starting at GameModel.undoMoves. And I'm pretty 
sure that this is the 
only way it can work until PyChess is refactored so that all GUI updates are 
queued and processed 
_only_ by the MainThread, which as you pointed out before, will be the single 
most important change 
towards stability we could make in the future.

Regardless of not being able to get it to work that way, I don't see a reason 
why 
CECPEngine.undoMoves should work independently of GameModel.undoMoves. As you 
know, the 
simple synchronous approach is always to be preferred because it's much more 
simple, easy to read, 
easy to debug, etc., and should be chosen by default unless it can be proved 
that doing otherwise 
provides significant speed benefits, and I don't see that happening here 
because were not really 
waiting for the engine's reply, just asynchronously sending it some commands, 
and we can't really 
proceed with additional GameModel.undoMoves()s anyway until the current one 
completes. In fact, I'd 
argue that the method taken in the patch of putting GameModel.undoMoves in a 
thread should be 
removed too later on after we do an overall refactor, since it's impossible to 
know at the time of the 
user undo request (at BoardControl.actionActivate) that it will be possible to 
undo that many moves 
later on once our request comes up in the queue and is processed by 
GameModel.undoMoves.

In the patch you'll see I separated the CECPEngine.undoMoves (and UCIEngine) 
into 
CECPEngine.playerUndoMoves and CECPEngine.spectatorUndoMoves since their logic 
is completely 
different based on whether the engine is a player or not. Between that, all the 
"if self.mode in 
(ANALYZING, INVERSE_ANALYZING):" in the (CECP,UCI)Engine code, the different 
*.putMove() and 
*.makeMove() methods, and the different logic handling in 
CECPEngine.__parseLine based on whether 
were in spectator or player mode, it's starting to look like CECPEngine (and 
UCIEngine) should be 
broken into separate CECPEnginePlayer and CECPEngineSpectator classes which 
inherit from a 
common CECPEngine with all the common code.

Original comment by mattgatto on 8 May 2010 at 1:40

GoogleCodeExporter commented 9 years ago

Reading this again and again, I have to agree with you, that the current model 
simply is a dead end on fixing this in an elegant way.

I have made some sketches for a model to isolate these things properly. I hope 
to post it in the 1.0 brainstorm during the coming week. I believe it will be 
waterproof and solve the problems you mention. Hopefully you'll as well be able 
to remember some of the pitfalls, so we can get this thing done with.

A design fix will however have to wait until perhaps a 0.10.1 release.

Until then I highly support merging the patch. You may want to move some of the 
GameModel print statements to the log module, other than that I don't see it 
needs much cleaning.

Original comment by lobais on 22 Aug 2010 at 11:08

Changed state: Started

GoogleCodeExporter commented 9 years ago

Fixed in revision 8799a74c72.

Original comment by mattgatto on 15 Oct 2010 at 3:06

Changed state: Fixed

importsfromgooglecode / pychess

GameModel.py <-> CECPEngine.py race condition #548