lichess-bot-devs / lichess-bot

A bridge between Lichess bots and chess engines
GNU Affero General Public License v3.0
765 stars 447 forks source link

gate timeout error messages #125

Closed OhJayGee closed 6 years ago

OhJayGee commented 6 years ago

I am seeing a number of errors from lichess-bot following the same pattern (see below). All of them lead to a abort/game over or loss on time.

Does anybody else see that? is that actual connectivity, because the URLs seem correct, and the machines' internet connectivity was fine as well...

best regards, OJG

+++ https://lichess.org/bG43Aj81/black Bullet vs BOT F_Me_Im_Famous(1500?) ERROR:backoff:Backing off api_post(...) for 0.9s (requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://lichess.org/api/bot/game/bG43Aj81/move/d1e2) ERROR:backoff:Giving up api_post(...) after 2 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/bG43Aj81/move/d1e2) Abandoning game due to connection error Traceback (most recent call last): File "lichess-bot.py", line 150, in play_game li.make_move(game.id, best_move) File "/ChessEngines/lichess-bot/lichess.py", line 71, in make_move return self.api_post(ENDPOINTS["move"].format(game_id, move)) File "/ChessEngines/lichess-bot/.venv/lib/python3.6/site-packages/backoff/_sync.py", line 99, in retry ret = target(*args, **kwargs) File "/ChessEngines/lichess-bot/lichess.py", line 61, in api_post response.raise_for_status() File "/ChessEngines/lichess-bot/.venv/lib/python3.6/site-packages/requests/models.py", line 935, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/bG43Aj81/move/d1e2 --- https://lichess.org/bG43Aj81/black Game over

ddugovic commented 6 years ago

I've seen similar messages, but only the case where the opponent hastily aborts the game before the first move.

ShailChoksi commented 6 years ago

This happens because:

  1. Lichess sends game started notification
  2. Lichess-bot starts the engine. While engine is starting: a. Opponent aborts b. Lichess send game ended notification
  3. Engine boots up and tries to play a move

If anyone has a good idea of how to fix this, make a PR or I will get to it once I have some time.

OhJayGee commented 6 years ago

I am rather seeing this in the middle of a game, not at the beginning... I'll try to get some examples with actual timestamps. Is there a way I can turn more verbose logging?

ShailChoksi commented 6 years ago

@OhJayGee you can pass the -v flag: python lichess-bot.py -v for verbose logging

gbtami commented 6 years ago

This can happen if game is finishing by any reason while BOT is thinking and then wants to li.make_move().

@careless25 maybe I'm wrong but I think 2.b. (Lichess send game ended notification) is a non existing thing. If this statement is correct I can imagine a workaround like this:

If HTTPError raised by li.make_move() we can look at BOTs ongoing games (using "/api/account/playing" endpoint) and if our current game is not in the ongoing games list then we know it's over. Otherwise re raise the exception.

What do you think, is this a usable idea or there is something better?

ddugovic commented 6 years ago

It seems the workaround is necessary in any event, since Lichess is online and messages take time to propagate.

OhJayGee commented 6 years ago

newer observation: in the last week I did not observe any timeout flags caused by this error. Did anything change on the backend?

drbeco commented 6 years ago

Thank you @gbtami , for linking bug #128. Here it is a log and a picture that express exactly that:

Here a complete log, since BOOT until error. Very small, because it is in the first move.

# [DEBUG file:xadreco.c line:596]: Starting optarg loop...
# DEBUG MACRO compiled value: 1
# Debug verbose level set at: 1
# play random: no. seed: -r 0
# connection: -c lichess
# wait: -x wait_xboard
# book: -b livro.txt
# Xadreco version 5.84 build 180626.010155 (C) 1994-2018, by Dr. Beco
# Xadreco comes with ABSOLUTELY NO WARRANTY;
# This is free software, and you are welcome to redistribute it
# under certain conditions; Please, visit http://www.fsf.org/licenses/gpl.html
# for details.

# scanf: xboard
# 
# feature done=0
# feature myname="Xadreco 5.84"
# feature ping=0 setboard=1 playother=1 san=0 usermove=0 time=1 draw=1 sigint=0 sigterm=1 reuse=0 analyze=1 variants="normal" colors=0 ics=0 name=0 pause=0 nps=0 debug=1 memory=0 smp=0 exclude=0 setscore=0
# feature done=1
# scanf: protover
# xboard: ignoring protover
# scanf: 2
# xboard: ignoring 2
# scanf: post
# xboard: post. Xadreco will show what its thinking. (1)
# scanf: easy
# xboard: easy. Xadreco stop thinking. (1)
# scanf: ping
# scanf: 123
# pong 123
# scanf: force
# scanf: setboard
# xboard: setboard. Xadreco will set a board position.
# scanf: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
# xboard: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
# scanf: w
# xboard: w
# scanf: KQkq
# xboard: KQkq
# scanf: -
# xboard: -
# scanf: 0
# xboard: 0
# scanf: 1
# xboard: 1
# scanf: force
# xboard: force. Xadreco is in force mode.
# scanf: setboard
# xboard: setboard. Xadreco will set a board position.
# scanf: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
# xboard: rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR
# scanf: w
# xboard: w
# scanf: KQkq
# xboard: KQkq
# scanf: -
# xboard: -
# scanf: 0
# xboard: 0
# scanf: 1
# xboard: 1
# scanf: level
# scanf: 0
# scanf: 0:10
# scanf: 0
# xadreco level: 10.0s+0.0s por 60 lances: ajustado para st 0.166667 s por lance
# scanf: go
# xboard: go. Xadreco is now white.
# xadreco : Error. I don't know what to play... Playing a random move (compjoga)!
# move f2f4
ERROR:backoff:Giving up api_post(...) after 1 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/QzqnWDGC/move/f2f4)
ERROR:backoff:Giving up play_game(...) after 12 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/QzqnWDGC/move/f2f4)
Traceback (most recent call last):
  File "/home/beco/Documents/fontes/python/lichess-bot/logging_pool.py", line 16, in __call__
    result = self.__callable(*args, **kwargs)
  File "/home/beco/Documents/fontes/python/lichess-bot/.venv/lib/python3.5/site-packages/backoff/_sync.py", line 99, in retry
    ret = target(*args, **kwargs)
  File "lichess-bot.py", line 126, in play_game
    play_first_move(game, engine, board, li)
  File "lichess-bot.py", line 173, in play_first_move
    li.make_move(game.id, best_move)
  File "/home/beco/Documents/fontes/python/lichess-bot/lichess.py", line 71, in make_move
    return self.api_post(ENDPOINTS["move"].format(game_id, move))
  File "/home/beco/Documents/fontes/python/lichess-bot/.venv/lib/python3.5/site-packages/backoff/_sync.py", line 99, in retry
    ret = target(*args, **kwargs)
  File "/home/beco/Documents/fontes/python/lichess-bot/lichess.py", line 61, in api_post
    response.raise_for_status()
  File "/home/beco/Documents/fontes/python/lichess-bot/.venv/lib/python3.5/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/QzqnWDGC/move/f2f4

You can see, it is the start position. Xadreco sent f2f4 as first move by the time opponent had already aborted the game:

image

lichess-bot tried to forward this move to the site and got a 400 error

400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/QzqnWDGC/move/f2f4

It seems that the bot needs to check if the game is still ongoing before trying to forward the move to the engine. The engine has no way to know that it don't need to move.

To the engine, just send a quit command and it will rest again.

ddugovic commented 6 years ago

Arguably HTTP 409 Conflict makes more sense than HTTP 400 Bad Request due to the opponent "editing" the game by aborting it.

Either way, 12 retries in the face of an HTTP 4xx response seems weird since those indicate Client Errors.

Hmm... maybe something like the following would help: https://github.com/litl/backoff/blob/a2a0132eed09d8e3e8329c09014e6689fd8ef4b1/README.rst#L95

drbeco commented 6 years ago

Here another twist:

1. Engine has only 1 second before losing on time
2. Engine spends 3 seconds thinking
3. Web page gives victory and a rematch button is enabled, clicked, and no response
4. Log shows engine finally decided to play, bot gave 400 error and hangs
5. Bot asks engine to quit after error (probably it was on buffer already)
6. Engine quits and say good bye
7. Bot tries something else I don't get, and get a 404 (accept, probably when I clicked rematch)

Here the complete log:

# xadreco : N.80. Pretas. Tempo 3.000000s. Acumulado: 3.000000s. Hora: 2018-06-29 13:33:03
# xadreco : offer draw (1) value: -3.15
# move c1e3
# offer draw
# xadreco : N.81. Brancas. Tempo 178.000000s. Acumulado: 178.000000s. Hora: 2018-06-29 13:36:01
DEBUG:chess.engine:<PopenProcess at 0x75624770 (pid=28212)> >> move c1e3
DEBUG:chess.engine:<PopenProcess at 0x75624770 (pid=28212)> >> offer draw
DEBUG:urllib3.connectionpool:Resetting dropped connection: lichess.org
DEBUG:urllib3.connectionpool:https://lichess.org:443 "POST /api/bot/game/eNJKQTr6/move/c1e3 HTTP/1.1" 400 None
ERROR:backoff:Giving up api_post(...) after 1 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/eNJKQTr6/move/c1e3)
Traceback (most recent call last):
  File "lichess-bot.py", line 151, in play_game
    li.make_move(game.id, best_move)
  File "/home/beco/Documents/fontes/python/lichess-bot/lichess.py", line 71, in make_move
    return self.api_post(ENDPOINTS["move"].format(game_id, move))
  File "/home/beco/Documents/fontes/python/lichess-bot/.venv/lib/python3.5/site-packages/backoff/_sync.py", line 99, in retry
    ret = target(*args, **kwargs)
  File "/home/beco/Documents/fontes/python/lichess-bot/lichess.py", line 61, in api_post
    response.raise_for_status()
  File "/home/beco/Documents/fontes/python/lichess-bot/.venv/lib/python3.5/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/eNJKQTr6/move/c1e3
DEBUG:chess.engine:<PopenProcess at 0x75624770 (pid=28212)> << quit
# scanf: quit
# xadreco : # Thanks for playing Xadreco.
# xadreco : sai ( 0 )
DEBUG:urllib3.connectionpool:Resetting dropped connection: lichess.org
DEBUG:urllib3.connectionpool:https://lichess.org:443 "POST /api/challenge/eNJKQTr6/accept HTTP/1.1" 404 None
ERROR:backoff:Giving up api_post(...) after 1 tries (requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://lichess.org/api/challenge/eNJKQTr6/accept)
DEBUG:urllib3.connectionpool:https://lichess.org:443 "POST /api/challenge/TgesCs9v/accept HTTP/1.1" 404 None
ERROR:backoff:Giving up api_post(...) after 1 tries (requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://lichess.org/api/challenge/TgesCs9v/accept)

So, it is kind of the same error, but instead of "abort" it was a game that ended due to normal reasons (timeout in this case).

Solution: Bot should see the game is ended, discard any output from engine and ask engine to quit.

ddugovic commented 6 years ago

@drbeco I agree, but isn't that more related to #131 ?

UCI specifies how much time the engine has remaining on the clock, and Stockfish already uses that clock time (via python-chess, via this library) to protect against a time loss.

drbeco commented 6 years ago

Hi @ddugovic , I think it is not related to protocol.

Both Xboard and UCI specifiees how much time the engine has left on the clock. Xboard actually have many ways to deliver time to the engine.

To tell that the engine author should use better the time control and never loses on time is to displace responsibilities from what each program have, mixing things.

Of course I understand engine authors that would not even "imagine" an engine that may lose on time, or that will resign without a fight, or whatever. Most of engine authors simply tries to deliver the most "powerful" move, whatever that means.

Xadreco is much more like a human, it does lose on time, it resigns, it offer draws, if sacrifices without sound continuation; man, it even play random moves every now and them (literally "random").

So, if a engine loses on time, the communication bot should command it (whatever protocol) to stop.

Also, if the engine reply with a move before it could process the command to stop, lichess-bot should just ignore it, instead of trying to post it and get an 404 error in a finished game.

ddugovic commented 6 years ago

@drbeco :

OhJayGee commented 6 years ago

Got a new incident: https://lichess.org/ovJEbL91/white#130 ERROR:backoff:Backing off api_post(...) for 0.6s (requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://lichess.org/api/bot/game/ovJEbL91/move/d5b4) ERROR:backoff:Giving up api_post(...) after 2 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/ovJEbL91/move/d5b4) The question here is: did the bot lose on time and didn't know (like you guys talked about earlier on this thread?) - or did the bot lose on time because of 504 Server Error - wherever that came from?

In a 3'+1 game the bot normally should never lose on time as it can move within 20msec, plus 37ms ping time to the lichess server, and I have configured a 500ms overhead time, so 1 second is plenty. Yet 0.6s penalty for server error could have triggered loss on time.

ddugovic commented 6 years ago

I too (running in OVH!) have observed a single HTTP 504 response:

$ sudo journalctl -u lichess-bot.service | grep -A 2 -B 10 504
Jul 04 15:36:32 vps178028 python3[27801]: +++ Process Free. Total Queued: 0. Total Used: 0
Jul 04 15:36:46 vps178028 python3[27801]:     Accept Bullet casual challenge from Purusa(2242)
Jul 04 15:36:46 vps178028 python3[27801]: --- Process Queue. Total Queued: 1. Total Used: 0
Jul 04 15:36:46 vps178028 python3[27801]: --- Process Used. Total Queued: 0. Total Used: 1
Jul 04 15:36:47 vps178028 python3[27801]: +++ https://lichess.org/LYwbnM4e/white Bullet vs Purusa(2242)
Jul 04 15:37:54 vps178028 python3[27801]: --- https://lichess.org/LYwbnM4e/white Game over
Jul 04 15:37:54 vps178028 python3[27801]: +++ Process Free. Total Queued: 0. Total Used: 0
Jul 04 16:15:00 vps178028 python3[27801]: Something went wrong. Game is starting and we don't have a queued process
Jul 04 16:15:00 vps178028 python3[27801]: --- Process Used. Total Queued: 0. Total Used: 1
Jul 04 16:15:00 vps178028 python3[27801]: +++ https://lichess.org/tfiUOVkq/white UltraBullet vs AI level 8
Jul 04 16:16:35 vps178028 python3[27801]: ERROR:backoff:Backing off api_post(...) for 0.6s (requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://lichess.org/api/bot/game/tfiUOVkq/move/g1g5)
Jul 04 16:16:37 vps178028 python3[27801]: ERROR:backoff:Giving up api_post(...) after 2 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://lichess.org/api/bot/game/tfiUOVkq/move/g1g5)
Jul 04 16:16:37 vps178028 python3[27801]: Abandoning game due to connection error
niklasf commented 6 years ago

This is very likely a problem of Lichess itself, not of the bot. One of the database servers is slower than usual, leading to timeouts every now and then. We're working on it ...

Edit: The issue I am talking about started on the 23rd of June. Timeouts should be much less frequent before that.