multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

"cec" component wrongly shown as "running" the whole time. #263

Closed DavidPCoster closed 9 months ago

DavidPCoster commented 11 months ago

The following plot shows that "cec" is "running" the whole time:

timeline

In fact cec is triggered as needed by trigger_cec, as shown by:

ets_expanded_5_auto

and, in this simulations, is not triggered at all.

The user, system and elapsed times for cec are: 0.04 0.60 1437.30 seconds.

LourensVeen commented 11 months ago

That's interesting. There's a bug with instances showing as running while they've finished and are waiting to shut down, but I can't immediately think of why an instance that never gets triggered shows as running. I'd expect it to register at the beginning, then spend all its time waiting for a message on F_INIT (which would show as light grey), and then shut down.

Did this simulation shut down cleanly, or did an instance crash and was it shut down by the manager?

DavidPCoster commented 11 months ago

Ran cleanly ...

DavidPCoster commented 11 months ago

In case it helps, here are the messages from stderr for a very similar case:

cec 2023-08-22 22:48:00.713 INFO: Registered with the manager
cec 2023-08-22 22:48:05.680 INFO: Received peer locations and base settings
cec 2023-08-22 22:48:05.683 INFO: Connecting to peer trigger_cec at [tcp:10.24.54.31:19065,10.28.54.31:19065,130.186.25.53:19065]
cec 2023-08-22 23:09:17.107 INFO: Deregistered from the manager
LourensVeen commented 9 months ago

I found the problem: the profiler doesn't record send and receive events for ClosePort messages because they're invisible to the user. That makes sense, but it means that the RECEIVE_WAIT period for that ClosePort isn't in the database, and that the time spent in it is shown as running. If you never receive a message, then that's all there is and the model seems to be running constantly, while really it's fast asleep.

I added a SHUTDOWN_WAIT event, which records the time spent waiting for the information needed to determine that we're shutting down (if we run again then there will be an ordinary RECEIVE and RECEIVE_WAIT, so SHUTDOWN_WAIT is only at the end of the run when we don't reuse the instance again).

While I was at it, I also added a SHUTDOWN event which covers the shutdown of the TCP server (which takes tens of ms and previously showed up as busy), and a DISCONNECT_WAIT was added before to fix #254, which takes place in between SHUTDOWN_WAIT and SHUTDOWN and covers the time period spent waiting for the instances we send messages to to pick up their ClosePort messages. There are a few small gaps in the graph left, but mostly this means that the last time the model is shown as running is at the start of its final call to reuse_instance().

DavidPCoster commented 9 months ago

Thanks!

LourensVeen commented 9 months ago

Fix released with version 0.7.1, please reopen if you encounter further issues.