Finding Crash Issue - Githubissues

toddw123 commented 7 years ago

So ive been trying to locate the cause of the random crash that seems to happen. It tends to happen within 15-25 minutes when running in linux, but on windows it can run for a lot longer before it crashes.

I did just change my visual studio's output to "debug" instead of "release" and then ran the program. It ended up crashing fairly quickly, and its claiming the cause has something to do with lists.

To be more specific, instead of pointing to a particular source code from my project that caused the problem (as you would hope), it opened up "list.h" (the standard c++ header) and claims this is where it crashed:

    _Myiter& operator--()
        {   // predecrement
        _Ptr = _Mylist::_Prevnode(_Ptr);
        return (*this);
        }

It points to the _Ptr = _Mylist::_Prevnode(_Ptr); line as the line that caused the crash.

As far as what i take away from this, i guess somewhere there is a list or array or something that is being used and the size is not being checked accurately or something.

If anyone is able to help figure this out, that would be awesome.

VoOoLoX commented 7 years ago

I can run bot for about hour or so (haven't tried running it for longer) without any crashes (only 1 account), occasional disconnect but that's rare as well. Just started bot in GDB, will leave it running for few hours to see if it crashes.

toddw123 commented 7 years ago

im running 22 at once. 1 bot should be fine id think. But the more accounts you use the more it stresses everything, and anything that is setup incorrect will most likely come to light. So hopefully i can figure out what is causing this.

Zeroeh commented 7 years ago

I ran 70 bots for 2 hours with only 1 disconnect. Granted it was on an older release but I if it's an issue with lists I don't think version makes a difference.

toddw123 commented 7 years ago

@Zeroeh it could be that it was an older version. Did it have the path-finding added to it like the current version? I believe that path-finding is part of the problem so i might look for a better one.

Also, there has to be a reason for the disconnects....I know that the code has significantly reduced the frequency of the bot dc'ing, but something still has to be causing it. No idea what, but my reference would be to open the normal game client and walk around the nexus for 60 minutes. You wont get disconnected. But, on the bots, you might. I say might because its extremely random (from what i can tell) and there is no specific time that they run for before getting dc'ed. I've had some bots run for 20 minutes before a dc, ive had others run for 300 minutes before a dc, and all inbetween those. Sometimes they go and go and go, other times something clearly happens that messes them up. Really wish i could figure out what it was.

Zeroeh commented 7 years ago

The release I was using was the one right before the packethandlers were converted to methods instead of the else-ifs, so it wouldn't have any of the path finding code.
With the random disconnects I'm fairly certain it's probably a timing issue with either the sockets or client itself. I've noticed that it happens more often when moving than not moving.

toddw123 commented 7 years ago

With the random disconnects I'm fairly certain it's probably a timing issue with either the sockets or client itself.

Hmm so ive been thinking about this, and when you first consider something like that then it makes total sense that it could be a possibility, because timing is important . But Then when you actually realize how the sockets and TCP work, you can come to the realization that it makes 0 sense.

Because the connect is TCP, the server might have a queue of packets ready to be sent to the client, but until the client actually bothers to read them (recv()) those packet havent been sent. So that is why the client is designed to reply to the packet before the next one is read. For example, each Update requires the client to send back an UpdateAck. I could read in an Update packet and then wait for a few seconds before i send the UpdateAck. As long as i dont continue to read packets while i wait, it wont be a problem. But if i get an Update packet and i read in another Update packet before sending an UpdateAck, then the server will kick you for failure to reply to the Update.

So timining doesnt make a whole lot of sense. Because all the packets that require an immediate response are set up to get that response before another packet is received.

Now maybe youre talking about the timing of the speed? As in the "elapsed" time that is being used to calculate the move speed is wrong? This could be true, but i also dont believe it is as i have tested it and the bots run exactly as fast as a player with equal speed.

With all that taken into account, i guess i cant honestly say i know its not a timing issue, but it doesnt make much sense if it is.

Also, if you update to the newest version, like 4 or 5 commits ago i added a logger for the last 2 packets sent. So anytime a bot is dc'ed i have it output the last 2 packets the bot sent. If it was related to move, the last packet to be sent would (or should) always be a Move packet. But i dont see the Move packet in the last two sent any more then i see any other packet.

So yeah im still very confused on what is causing the bots to dc randomly. Its not too big a deal since most of the time they go for awhile, but there is still something happening that causes that dc. And there is something that does cause a crash on windows after a long long time (makes me thing its a memory issue), and it crashes on linux pretty quickly. So theres definitely something wrong somewhere.

VoOoLoX commented 7 years ago

After running 30 bots for about an hour it crashed. I was running it in GDB and got the following error:

Thread 1 "Clientless" received signal SIGSEGV, Segmentation fault. 0x000000000042c0d2 in void __gnu_cxx::new_allocator<std::pair<int const, int> >::construct<std::pair<int const, int>, std::pair<int const, int> const&>(std::pair<int const, int>*, std::pair<int const, int> const&) ()

Segmentation faults occur when program tries to access memory it's not allowed to. From the error message I found: https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html (I haven't read the whole thing, might do it later, way to lazy now)

Seems to be a problem with memory allocation, since the error is linked to the new operator. Hope this is somewhat useful

Zeroeh commented 7 years ago

A couple days ago I got access to a network with 150MBps up and down so I was able to test 200 bots. I had them sit still for an hour and nothing (just abnormally high cpu usage after the 160th bot loaded). I then loaded them while moving in a circle. Cpu jumped to max (not sure if this matters) and they all crashed after a minute of moving. What was really interesting was that the first ones to load in were the first ones to crash and it went in order.

toddw123 commented 7 years ago

were the first ones to crash and it went in order.

Are you using the newest code? And also, thats extremely unusual so i wouldnt be surprised if it just has something to do with running 200 at once. Could be that whatever you were running it on couldnt or wouldnt allow that many threads for one process, and so it killed the threads off one by one starting with the oldest first. And i say this because for me, they DC in completely random order when running 22. Some will DC more then others, some wont DC at all.

Although im really looking for why the program crashes more then why the bots dc, but my suspicion is the two are related.

Thanks for the info VoOoLoX, ive been assuming a memory issue (allocation) for awhile now, and its what linux would spit out when the program would crash as well. I just dont know where the allocation thats causing it is at.

edit: my guess, is possibly the way im storing mapTiles in the client class. Using std::unordered_map<int, std::unordered_map<int, int>> mapTiles; could be causing issues when setting a specific tile. It would kind of make sense because the error you showed has something about std::pair<> which is used in stuff like the unordered_map or vectors and whatnot. Ill probably change the tiles to work like i have them in the path finding where its an single array and tiles are stored as someArray[y * mapWidth + x].

Zeroeh commented 7 years ago

Ahh. I haven't had the program literally crash on me before O_O. Maybe it has to do with the pathfinding stuff you recently added?

toddw123 commented 7 years ago

Ahh. I haven't had the program literally crash on me before O_O. Maybe it has to do with the pathfinding stuff you recently added?

Nah it still occurred before that. Like i said though, on windows it can be very random. I had a program run for 2 days before it crashed on me before. And then on the other hand, i had program crash on me within an hour. Its very strange on windows, and my guess is due to windows having some kind of leeway with the allocation problem.

Zeroeh commented 7 years ago

One thing I DID notice is a huge allocation of memory (completely random). Some days it would only use the usual 4.6ish MB while other days it went up to 460+MB. Never bothered debugging it, but if it's segmentation faults and random memory allocations then it's probably a lose pointer somewhere.

VoOoLoX commented 7 years ago

After some more debugging seems like the error I showed occurs in the main thread (not 100% sure about that).Also it looks like not all client threads ware running when the error occurred which leads me to believe some of the client threads got destroyed.

toddw123 commented 7 years ago

Ive never noticed a client thread destroyed, as in i can always login to a server and see my bot on that server. Unless you are talking about when the program crashed, because i have noticed that i will get the "program failed" popup on windows but i will still see some output in the console from other threads.

My guess is its most likely something to do with the way im storing the map in the client. I plan to change that, just havent gotten around to doing any work on the client for a week or so.

VoOoLoX commented 7 years ago

Take a look at this https://pastebin.com/uKSpFZ7j It's list of threads before and after the crash and it shows current running function of each thread. So I guess I was wrong none of the threads are getting destroyed, but in this case 2 of the threads are messed up.

toddw123 commented 7 years ago

yeah looks exactly like i said, some of them are having a problem with the unordered_map<int, unordered_map<int,int>> used to store the map tile types in the client.

toddw123 commented 7 years ago

I just modified the way the tiles are stored in the client class/threads.

Before, you would access a specific tile using this->mapTile[x][y]. Now, you access a specific tile using this->mapTile[y * this->mapWidth + x]. Instead of using std::unordered_map<int, std::unordered_map<int, int>> it is now just std::vector<int>. Hopefully this solves the problem. Im about to pull the changes down to my linux VM and give it a try.

edit: the code compiles on both windows and linux no problem. Since the bots would always crash faster on linux then on windows, i am testing it on linux right now. So far so good! Im going to let it run for 20-30 minutes on linux and see if its crashes at all. It use to crash pretty quickly before, so i can already say its looking better then before.

edit2: so i cant decided if it crashed due to an error......or if it crashed because i stopped paying attention to it and the OS went into one of those modes thats kinda like logging the user out but not quite. All i know is, it was running for about 20 minutes no problem, then i went to do other work and minimized the VM, came back to it just now and had to relogin the user and once i did it shows the program crashed. Lol oh well. Guess ill try again later.

VoOoLoX commented 7 years ago

I've been running it since you pushed the update and it's still running no problem. Seems good so far.

toddw123 commented 7 years ago

I've been running it since you pushed the update and it's still running no problem. Seems good so far.

Awesome!

Im pretty sure the reason mine crashed was because the user account got logged out/suspended like i said. I dont think it crashed on me because of a code problem lol. Ill have to run a test on windows now and make sure it doesnt crash on windows either.

Thanks for all the help and testing sofar.

DavidK1m commented 7 years ago

Works well on windows too without crashing!

toddw123 commented 7 years ago

Yeah windows it can take a long long time before it crashes. Well, depending on how you run it. Running in debug mode use to crash pretty quick, but release mode wouldnt.

But thanks for checking, both of you. I havent bothered to check if the mem usage is fixed now. I assume that was the cause of the massive spike in memory sometimes.

edit: ive only been monitoring it for a little bit but the cpu/mem usage is definitely down. The CPU use to spike 50%+ on me, and now its back down to anywhere from 0.5%-10%. Much much better there. And Memory might still have some leaks, but it looks better for now on my system atleast.

toddw123 commented 7 years ago

Closing this because the crash issue seems to be fixed. Im opening another issue though that will be for discussing the random disconnects.

toddw123 / RotMG_Clientless

Finding Crash Issue #47