riddlesio / issues

Bug reports and starter bot submissions for the Riddles.io platform
0 stars 0 forks source link

Unexplainable timeouts on server #37

Closed pepijno closed 7 years ago

pepijno commented 7 years ago

On the server, my bot suffers from a lot of timeouts which just seem to happen at random at a lot of games (sometimes, even after the first move as is the case with https://starapple.riddles.io/competitions/light-riders/matches/43551772-12be-4e23-8cf3-fceedce1db59/dump). However, when using the jar of the game engine provided on the github on my own machine, the bot never has a timeout. Am I correct in assuming this is an issue related with the server?

JimVanEeden commented 7 years ago

I'm actually not sure what causes this. Might be your bot as well, as I don't immediately see any other bots with this problem. On the other hand, I can imagine the server causing this as well. We'll investigate.

JimVanEeden commented 7 years ago

Are you using threading in your bot by any chance?

pepijno commented 7 years ago

No, I'm not using any threading whatsoever.

davidlindenbaum commented 7 years ago

I seem to be having the same problem. Before today, I hadn't updated my bot for a couple weeks. My bot hasn't had a match go over 20 rounds in the last 3 days. Looking at my match log, 3 days ago there's a very clear boundary where my bot goes from decently long matches to never more that 20. But, even before that boundary I had a lot of timeouts, and I don't remember seeing any timeouts when I was actively watching my bot a couple weeks ago. Something definitely changed, probably twice, and it's not my bot.

My bot is in Java, it's single threaded, and I'm getting the timebank from the engine and using System.currentTimeMillis() to measure time from the moment I get the timebank. I never see timeouts running the engine locally, and when there isn't a timeout online my bot seems to agree with the engine on how much has elapsed. https://starapple.riddles.io/competitions/light-riders/matches/06ee8d96-7fb0-42fe-8281-ea168f468be7/dump is an example where I have my bot printing the remaining timebank at a frequent interval, and it's checking the time remaining even more often than that. I am quite sure that there is no code path which fails to hit that print/check very frequently (definitely no more than a millisecond between checks).

davidlindenbaum commented 7 years ago

Alright, I think I found the problem. It may be timing out while the Java garbage collector runs. I'm using significant amounts of memory and the GC actually takes a substantial amount of time, so that's why I'm timing out before the program notices it's low on time. Calling System.gc() periodically seems to fix it, though that slows down my program by quite a lot.

So I'm still pretty sure I'm right that something changed a few days ago. Did you reduce the amount of memory that's available to us or something? Can you give us more details about JVM parameters and such?

Now I have to reconsider my whole architecture (and programming language choice) with GC/memory usage in mind :(.

rogster001 commented 7 years ago

Somebody has posted in discussions on the site saying it is not just java bots - I am not sure what their evidence is though. I am incilned to believe it is GC / jvm resource related and will only be java. Can we please have some info on the parameters or some guidance as to what is being done / what is occuring if not GC. We need to tune our code to accomodate or wait for a fix whilst losing points!

dhensen commented 7 years ago

How much total memory is your java bot using while running locally?

rogster001 commented 7 years ago

It isn't java related, there is evidence from eg C++ of the same thing, posted in the site discussion

JimVanEeden commented 7 years ago

So we have most likely found the problem, which started showing up when we made some changes on the servers that run the matches. Basically, bots the use a lot of memory go over the limit (which is unintentionally lower now) and they are shut down. This also explains why calling the java garbage collector might help.

As a very quick fix, we now run Java bots with the option -Xmx256m. This prevents them from using too much memory and then being shut down. As a general note: if you have this problem, try to lower your bot's memory usage. We can't enforce this in all languages like we did with Java. I'm going to try to get a warning in the bot's error log if it uses too much memory, so it doesn't seem like a random timeout anymore.

We're going to reconfigure the servers now in the way we originally intended, so bots have more memory available. We're also still working on creating separate environments for each bot to run in, instead off all bots + the engine in the same environment. We expect to finish this in about a month.

Hopefully I've explained it in a satisfying way to you guys. We're happy to hear any questions, concerns, and especially suggestions.

rogster001 commented 7 years ago

Hi Thanks for the update, could you please let me know what the Xmx was before you set to 256?

JimVanEeden commented 7 years ago

We hadn't set that option at all before. It just ran with java -jar, now with java -jar -Xmx256m

rogster001 commented 7 years ago

But… would it not have had like 25% of OS ram for heap without any Xmx flag set? So it is now less allocation at 256?

Subject: Re: [riddlesio/issues] Unexplainable timeouts on server (#37)

We hadn't set that option at all before. It just ran with java -jar, now with java -jar -Xmx256m

pepijno commented 7 years ago

For me, the mention of memory limits got me debugging my code and I discovered a nasty memory leak. So I fixed the memory leak and now I don't receive the timeouts anymore (only tested a few games so far).

However, if I can make some suggestions, maybe have a fixed memory limit per bot and log an out of memory error in the logs of a game instead of a simple timeout, that would help a lot.

JimVanEeden commented 7 years ago

@rogster001 I'm not sure about that, but what I do know is that the total for everything is 500Mb memory.

@pepijno Yes we're going try and get that in. Just timing out without any indication is kind of crappy.

baris-sim-ci commented 7 years ago

Ah that could explain my c++ bot's weird problem. My old version works without any timeouts but new optimized and faster version always timeouts around turn 20. I do not have any memory leaks in the code according to valgrind and my code doesn't get terminated by sigsegv or sigabort in the server, it just stops. The difference between the versions is that the new one evaluates 10x more nodes and with all the node caching i do it is possible that it is running out of memory. It would be really helpful to have a message saying your bot ran out of memory. I will try to test without caching to see if it solves my timeouts.

nikovanmeurs commented 7 years ago

@pepijno The fixed memory limit is on the todo list. However, we can only do so after we split up resource allocation for the separate processes. Right now it's shared between the engine and the bots. We hope to have the new sandbox with the separated resource allocation live at the beginning of September.

Regarding the memory limitations:

rogster001 commented 7 years ago

I have had a look at what my bot uses, it is very confusing through the IDE console as its shows much higher numbers, presumably as it is spawned from within the runtime of the IDE or something. Anyhow running from command line I get much, much lower figures. So by that token I can only imagine that some timeouts were occurring where the opponent bot was eating up a big chunk of the shared resource?? Its kind of hard to be certain due to the two different pictures I am getting but I have to be confident of the stand-alone from command line being correct I suppose.

nikovanmeurs commented 7 years ago

@rogster001 Yes, the shared resources can certainly be the cause of one bot consuming the resources at the cost of the second. This is exactly why we're currently working on improving the sandbox.

pepijno commented 7 years ago

What is the status of improving the sandbox? My bot still receives a lot of timeouts which doesn't happen on my own machine.

JimVanEeden commented 7 years ago

@pepijno You should get memory warnings now right? The improved sandbox turned out to be an even bigger project than we already anticipated. We've started working on it, but it might be a couple of months until it's finished. Sorry that it's taking long, but this is something we really want to do correctly.

pepijno commented 7 years ago

I did got the memory warnings, I resolved those warnings but it did not resolve the issues. I understand the sandbox takes some time to implement so do you have some suggestions for me to resolve the issues related to time with my bot?

JimVanEeden commented 7 years ago

Well that's probably very specific to your bot, so hard to say. In theory it shouldn't run anything differently than locally, so the only thing I can think of is that your computer is faster than our servers. Meaning that making your bot as efficient as possible might solve the problem.

rogster001 commented 7 years ago

I agree - with JimVanEden - Your local machine is probably a bit faster. I have seen this myself when uploading a bot which was just within limits when operating test matches on my computer. On the game server it timed out a lot. I would calibrate your timing expectations with this in mind.

There was an actual issue with timeout was when one or both bots was consuming all the runtime memory and you would suddenly see timeouts from nowhere. That's been sorted for some time.

pepijno commented 7 years ago

I believe making my bot more efficient wouldn't solve anything, as at the beginning of each turn the bot fixes an amount of time to run and then simply stops and returns a move after this time has elapsed.

Though the sudden time-outs are indeed solved, the problem is that the elapsed time reported by my bot and the game engine are not the same. For example, my bot reports an elapsed time of 200ms and the game engine deducts 800ms from the time bank. This mainly is a problem when the time bank is nearly empty. For example, if the time bank has 500ms left, even an allocation of 150ms could result in a time-out. Also the elapsed time is not even consistent within the same game, meaning that sometimes 200ms from the results in 400ms in the time bank and the next move 200ms results in 600ms in the time bank.

I do believe it is specific to my bot but I have no clue as how to fix it since I just use c++'s clock to get the time. Maybe I'm just missing something very obvious but right now I cannot pinpoint the cause of the problem.

JimVanEeden commented 7 years ago

It's very hard for me to find the problem as well. If you look at the match-wrapper implementation (on our Github page), you can see the way we measure the elapsed time. Basically: start timer -> ask response -> stop timer. So no clue where that huge time difference comes from.

rogster001 commented 7 years ago

How do you mean 'C++ clock?' Can you not use a System(..) call? presumably for the immediate future of this game it will continue to be hosted on the same OS

nikovanmeurs commented 7 years ago

The use of c++ clock explains the difference. It is important to note that clock and Java's System.currentTimeMillis are not the same at all. See this Stack Overflow thread for more info: https://stackoverflow.com/questions/35811957/time-difference-between-clock-time-in-c-and-system-currenttimemillis-in-ja.

JimVanEeden commented 7 years ago

Based on that it seems I should change the implementation in the match-wrapper as well, to System.nanoTime()

nikovanmeurs commented 7 years ago

The update to the match-wrapper has just been deployed, please let us know if this solved the timing inconsistencies.

pepijno commented 7 years ago

The last couple of matches didn't show any inconsistencies in the timing anymore, thanks for all the help. If the problem arises again in the future I will let you know.