coral_0.8.0-beta1: Random crash after changing config files

mrindar commented 7 years ago

Don't have enough information for this issue yet, but after I started working with coral_0.8.0-beta1 coralslave randomly crashes at startup after having made changes to .execonf or .sysconf.

This is a typical scenario:

coralslaveprovider /path/to/fmus -o out/
coralmaster run .execonf .sysconf
coralslaves spawn and simulation continues and terminates normally
make a change in .execonf, e.g. stop time
coralmaster run .execonf .sysconf
spawned coralslave crashes
1. Which spawned coralslave that crash seems random (sorry about that)
close all spawned cloralslave windows
close coralslaveprovider
coralslaveprovider --clean-cache
repeat 1 and 2
coralslaves spawn and simulation continues and terminates normally

It should be noted that running coralslaveprovider --clean-cache does not always work, sometimes I have to repeat the process a couple of times until the coralslaves stop crashing. Also, this doesn't always happen when a change is made to .execonf or .sysconf, but every time it has happened it has been after making a change to one of them.

Some more investigation is definitely needed on this issue

kyllingstad commented 7 years ago

It's hard to say which changes since 0.7 could cause this. The first thing that comes to mind would be the timeout stuff, but that shouldn't be affected by changing configuration files. Actually, I can't think of anything that would cause some kind of memory effect wrt. the configuration files.

Some things you could do which could possibly narrow this down:

Try disabling all timeouts (i.e., set them to -1). If things start to run smoothly then, enable them one at a time and try to observe which one triggers the issue.
Try going back to 0.7.1 for a while and see if the problems occur there too.

mrindar commented 7 years ago

So it appears that changing the configuration files has nothing to do with it. I've been running a few simulations 10ish and 2 crashed (no changes to the configuration files). If I try to start a new simulation after the crash (not running --clean-cache on the slaveprovider), I get a [warning] GetSlaveTypes request to slave provider 61926fa9-6de0-456c-afa5-69034f0c37c9 failed (timed out Error: Slave type not found: SomeFmu error message, and it seems to always be on the same FMU, but that't not the one that necessarily crashes. The coralslave.exe which crashes isn't always for the same FMU.

EDIT I'm sorry to say, but this isn't consistent either :P

mrindar commented 7 years ago

When I tried to press 'Debug' in the coralslave.exe crash window, Visual Studio gave me this message: Unhandled exception at 0x743B170C (mswsock.dll) in coralslave.exe: 0xC0000005: Access violation executing location 0x743B170C. Don't know if it means something

mrindar commented 7 years ago

I'm curious if this is similar to the zeroMQ crash we've struggled with in the past, except then the crash took a proper choke hold on the CPU and a hard restart was required. However, that crash also seemed rather random, much like this one, and the mswsock.dll sounds like some socket stuff if I put on my sherlock holmes hat

mrindar commented 7 years ago

Oh! This is interesting. So,

I run a simulation
A 'coralslave.exe has crashed window' shows up
I press 'Debug'
Visual Studio fires up and shows this exception: Unhandled exception at 0x743B170C (mswsock.dll) in coralslave.exe: 0xC0000005: Access violation executing location 0x743B170C.
I press 'Continue'
The coralslave continues and the simulation starts and finishes normally

mrindar commented 7 years ago

Simulation Number	Number of coralslaves.exe crashed	Pressing 'Continue' after crash worked
1	1	yes
2	1	yes
3	0	N/A
4	0	N/A
5	1	yes
6	1	yes
7	0	N/A
8	1	yes
9	0	N/A
10	0	N/A
11	0	N/A
12	3	yes
13	0	N/A
14	0	N/A
15	0	N/A
16	0	N/A
17	0	N/A
18	0	N/A
19	0	N/A
20	0	N/A

So, not very scientific here but, approximately 6 in 20 simulations crashes and one interesting one here is simulation 12 where 3 coralslaves crashed. So first one crashed, then i pressed 'Continue' in the debugger and then another one crashed, and so on. Also it seems like pressing 'Continue' in the debugger when a crash occurs, works every time

kyllingstad commented 7 years ago

I've run your setup 50–60-ish times myself now, using coral 0.8.0-beta1, and it didn't crash once. So this is going to be hard to track down, I guess. You're still on Windows 7?

mrindar commented 7 years ago

Awesome :P Still on Windows 7, yes

viproma / coral

coral_0.8.0-beta1: Random crash after changing config files #29

Simulation Number	Number of coralslaves.exe crashed	Pressing 'Continue' after crash worked
1	1	yes
2	1	yes
3	0	N/A
4	0	N/A
5	1	yes
6	1	yes
7	0	N/A
8	1	yes
9	0	N/A
10	0	N/A
11	0	N/A
12	3	yes
13	0	N/A
14	0	N/A
15	0	N/A
16	0	N/A
17	0	N/A
18	0	N/A
19	0	N/A
20	0	N/A

Simulation Number	Number of coralslaves.exe crashed	Pressing 'Continue' after crash worked
1	1	yes
2	1	yes
3	0	N/A
4	0	N/A
5	1	yes
6	1	yes
7	0	N/A
8	1	yes
9	0	N/A
10	0	N/A
11	0	N/A
12	3	yes
13	0	N/A
14	0	N/A
15	0	N/A
16	0	N/A
17	0	N/A
18	0	N/A
19	0	N/A
20	0	N/A

Simulation Number	Number of coralslaves.exe crashed	Pressing 'Continue' after crash worked
1	1	yes
2	1	yes
3	0	N/A
4	0	N/A
5	1	yes
6	1	yes
7	0	N/A
8	1	yes
9	0	N/A
10	0	N/A
11	0	N/A
12	3	yes
13	0	N/A
14	0	N/A
15	0	N/A
16	0	N/A
17	0	N/A
18	0	N/A
19	0	N/A
20	0	N/A