hquxmu / ai-contest

Automatically exported from code.google.com/p/ai-contest
0 stars 0 forks source link

Games lost in first turn. Unknown reason #202

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
http://ai-contest.com/visualizer.php?game_id=5869245
http://ai-contest.com/visualizer.php?game_id=5867241
http://ai-contest.com/visualizer.php?game_id=5865078
http://ai-contest.com/visualizer.php?game_id=5865072
http://ai-contest.com/visualizer.php?game_id=5863264
http://ai-contest.com/visualizer.php?game_id=5862250
http://ai-contest.com/visualizer.php?game_id=5860556

All games lost in the first turn, for not know reason. Games replayed locally 
with the same bot without issues.

Seems to be happening to other C# players as well, so may be related:

http://ai-contest.com/forum/viewtopic.php?f=18&t=806
http://ai-contest.com/forum/viewtopic.php?f=18&t=888 (Ben and Zilog users)

And many more samples in the forums.

L

Original issue reported on code.google.com by lord.loc...@gmail.com on 13 Oct 2010 at 5:28

GoogleCodeExporter commented 8 years ago
I had such a problem with my lua submission.
I uploaded c++ lua interpreter & lua bot.
It worked fine with tcp server, but failed on the first turn every game without 
any error message.
The first time it appeared to be my bug - I called "MyBot.lua" from the c++, 
though the actual file name was "mybot.lua", so my c++ interpreter failed to 
find the script and terminated.
The second time all games worked fine except for worker=25 computer - it failed 
every time I played a game on that server.
So I reuploaded my submission so that it worked fine.

Though my the most actual problem with bot games are still there - unexpected 
loss without an error message though my bot seemed to win, like this: 
http://ai-contest.com/profile.php?user_id=9336

Original comment by buratin....@gmail.com on 15 Oct 2010 at 5:56

GoogleCodeExporter commented 8 years ago
I have the same problem with my last submission (in C#)
 - all the game lost in the first turn ...and no error message displayed in the game window.

My bot it's been tested here on dozens of maps ...and i don't get any error on 
the first turn as well as later 

http://www.ai-contest.com/visualizer.php?game_id=5900455
http://www.ai-contest.com/visualizer.php?game_id=5899939
http://www.ai-contest.com/visualizer.php?game_id=5896566
http://www.ai-contest.com/visualizer.php?game_id=5896566

Original comment by parvuval...@gmail.com on 15 Oct 2010 at 10:25

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Issue happens even with StarterPackage (csharp_starter_package.zip)

http://ai-contest.com/visualizer.php?game_id=5960264

My guess: environment startup is too slow: it not always can start program in 3 
seconds. 
Possible solution: Give some time to bot After startup and Before sending 
commands. This doesn't need changes in rules because time for first turn still 
will be 3 seconds.

Original comment by dmitrisc...@gmail.com on 19 Oct 2010 at 3:35

GoogleCodeExporter commented 8 years ago
Could it be "not enough memory" issue ?

Original comment by buratin....@gmail.com on 19 Oct 2010 at 9:51

GoogleCodeExporter commented 8 years ago
I think the problem is a simple CPU scheduling problem.  It affects all bots 
regardless of language.  If I understand correctly, I think the game server 
works like this:

1. Start bot 1 as a new process
2. Start bot 2 as a new process
3. Send the game state to both bots using a blocking write
4. Until the time limit is reached, poll each bot using a non-blocking read

Now how much actual CPU time does each bot get?  It depends entirely on process 
scheduling, which is not under the control of the game engine.  I think 
everything is running inside a virtual server instance, so if for some reason 
both bots end up on the same "virtual CPU", they could end up competing for 
time slices.  If there are other processes besides the game engine and the bots 
running, the situation is even worse.

I suggest the game engine should be modified so that only one bot is active at 
a time.  That gives the active bot the best possible chance that it will 
actually get its fair share of CPU time.  It would work like this:

1. Start bot 1 as a new process
2. Start bot 2 as a new process
3. Send the game state to bot 1
4. Poll bot 1 for its moves.  Save them, but do not modify the game state yet
5. Send the game state to bot 2
6. Poll bot 2 for its moves.  Save them, but do not modify the game state yet
7. Update the game state with the moves read during the turn

Even though bot 1 moves first, bot 2 doesn't see what bot 1 did, so there is no 
gameplay advantage to the turn scheduling.  Assuming that bot 2 is blocked 
waiting for the game state while bot 1 is moving, the bots will not end up 
competing for CPU time.

A disadvantage to this arrangement is that spare CPU cycles may go unused, and 
games may take longer on average to play.  However, consider that people may 
start deciding to re-submit their bot each time they lose due to bad scheduling 
luck.  That too will create churn on the game server.  I think it is better to 
play the game more fairly, even if that means playing it a little more slowly.

Original comment by jklan...@gmail.com on 22 Oct 2010 at 9:35

GoogleCodeExporter commented 8 years ago
I am troubleshooting the user sandbox code, and it looks like there is a 
serious problem with SSH input buffering.  The tournament manager communicates 
with the bots over SSH, but SSH is holding on to the input in its buffer.  It 
never reaches the bot.

Original comment by jklan...@gmail.com on 24 Oct 2010 at 8:19

GoogleCodeExporter commented 8 years ago
Just in case you haven't noticed, ssh is only used on the main server. Of 
course it's still a problem if it's not working correctly.

Original comment by janzert on 24 Oct 2010 at 9:49

GoogleCodeExporter commented 8 years ago
Sorry folks, what I thought was an SSH buffering issue was actually a buffering 
issue elsewhere, and it was not on the production code, it was on a piece of 
code that I had modified.  To make a long story short, in Python you _must_ use 
file.readline(), you _cannot_ do file.next(), if you want to read just one line 
from a pipe.

Original comment by jklan...@gmail.com on 25 Oct 2010 at 2:55

GoogleCodeExporter commented 8 years ago
I can't be 100% sure, but after observing 22 games, looking at game_info.php to 
see on which worker they were played, it seems for my bot there's a two way 
implication: game played on worker 0 <-> game lost in first turn. I.e. all 
games played on worker 0 failed to start, all games on other workers played 
fine. http://ai-contest.com/profile.php?user_id=9786

buratin.barabanus also indicates that he did not receive any error message on 
the failed games. This also points to worker 0 - which did not have error 
reporting at that time.

Original comment by tjverw...@gmail.com on 25 Oct 2010 at 5:02

GoogleCodeExporter commented 8 years ago
For me and some other people on the forums, the bad worker was 55.  Perhaps 
it's not so much that there is a bad machine, but that a machine gets in a 
certain state and then it starts a failing streak.  So far I have run over a 
hundred games locally using the exact same game engine as is used in the cloud, 
and I have yet to reproduce the behavior.  I even tried re-testing some of the 
first-turn-failed maps using my bot playing against itself or other bots.  No 
timeouts.

I talked with janzert about server load the other day, and he said that load 
averages on the game servers are not particularly high.  Having thought about 
it further, those averages are per minute I think.  So even if the load average 
for a minute is low, it is still possible that during the second which a bot 
gets to move, the CPU is loaded and the bot can't run.  You can slam a core for 
several seconds and still have the overall average load be very low.

I am working on an experimental version of the Python engine which avoids 
polling and therefore uses less CPU time than the one used in production.  For 
some reason it uses more overall wallclock time to run games.  I don't 
understand what is going on.  Below are some example times for the same game.  
Perhaps they don't mean what I think they mean.

Production Engine:

real    0m3.875s
user    0m3.280s
sys 0m0.310s

Experimental Engine:

real    0m4.549s
user    0m0.230s
sys 0m0.080s

Original comment by jklan...@gmail.com on 25 Oct 2010 at 5:30

GoogleCodeExporter commented 8 years ago
I hope that will improve things. Btw, it was my understanding that worker 0, 
the main server starts games / bots a bit differently from the others. Perhaps 
it is just slower in starting everything up and my bot really does time out, 
but only on worker 0?

Here's an update on my data:
My bot now played 40 games, 12 games failed to start.
All games on worker 0 failed (11x) + 1 game failed on worker 64.
Can't my bot unsubscribe from worker 0? :-P

Original comment by tjverw...@gmail.com on 26 Oct 2010 at 8:42

GoogleCodeExporter commented 8 years ago
Attached is an experimental new game engine.  It offers:

* approximately double the overall game throughput
* correct detection of multiple different types of bot errors
* pluggable I/O multiplexors
* a full test suite covering failure cases as well as compatibility with the 
existing engine
* a compatible API to the existing engine (play_game())

The README.txt file contains more information.  The README file also discusses 
two possible security issues in the existing code, so even if it is unlikely 
that a new engine will be adopted at this time, at least those issues should be 
considered.

Original comment by jklan...@gmail.com on 28 Oct 2010 at 12:25

Attachments:

GoogleCodeExporter commented 8 years ago
I discovered that the library I'm using tries to create a Java Thread. Since I 
don't need multi threading, I refactored the code and removed that 
dependency/requirement. My bot seems to be working on worker 0 now.

Original comment by tjverw...@gmail.com on 28 Oct 2010 at 11:27