Closed DavidPCoster closed 9 months ago
Ah, no, this is a plotting artifact. I just didn't consider this case.
What's happening is that the processing isn't done by libmuscle, so I don't know that it's happening. All I have is data about communication and waiting. So the plot is drawn by first drawing a dark grey bar covering the whole run time, and then drawing communication and waiting events on top. I'm drawing all of the grey bars from the start of the whole simulation to the end, rather than from the start to the end of the run of that specific instance.
So what actually happens here is that init
starts up, registers, runs, sends its result, and then deregisters and shuts down (assuming it has no incoming F_INIT ports), while the rest of the simulation continues. It just gets plotted incorrectly. Will fix, thanks!
Hold on, I'm wrong. It's not a plotting issue.
I've been looking into this a bit, and it's not a plotting problem, it's due to how the shutdown algorithm of MUSCLE3 works. What happens is that init
starts up, sends its output, then shuts down, and as part of shutting down closes its outgoing ports. This entails sending a special ClosePort
notification on the port that the port will be closed, and that no more messages will be coming. This lets the instances connected to those ports know that they won't be getting any more inputs, and can therefore shut down too.
However, this notification is really just a message, and messages are sent by putting them in a local queue and waiting for the recipient to pick them up. So that's what we do here too. The recipient however is busy calculating things, and won't be receiving that notification until next time it calls reuse_instance()
, which is all the way at the end of the simulation. And so init
continues to be active until the instances it sends to have completed their current reuse iteration.
Shutting down a MUSCLE3 simulation cleanly is nontrivial, and I don't think the algorithm needs to be changed, but it may be possible to deregister from the manager before closing the ports, which would fix the statistics. Or we could add a separate profiling event type to record the moment we start shutting down, and use that in the statistics, that would also fix it.
If it helps, this is what I find in muscle3.log
<g2dpc@s53 ~/GIT/ets_paf>egrep 'instance init|^init' $RUN/muscle3_manager.log
muscle_manager 2023-07-06 22:39:10,374 INFO libmuscle.manager.mmp_server: Registered instance init
init 2023-07-06 22:39:15,695 DEBUG init: Sending message on final_equilibrium_out
init 2023-07-06 22:39:19,669 DEBUG init: Sending message on final_core_profiles_out
init 2023-07-06 22:39:36,996 DEBUG init: Sending message on final_core_transport_out
init 2023-07-06 22:39:43,832 DEBUG init: Sending message on final_core_sources_out
init 2023-07-06 22:39:47,220 DEBUG init: Sending message on final_transport_solver_numerics_out
init 2023-07-06 22:39:47,222 DEBUG init: Closing port final_equilibrium_out
init 2023-07-06 22:39:47,222 DEBUG init: Sending message on final_equilibrium_out
init 2023-07-06 22:39:47,222 DEBUG init: Closing port final_core_profiles_out
init 2023-07-06 22:39:47,223 DEBUG init: Sending message on final_core_profiles_out
init 2023-07-06 22:39:47,223 DEBUG init: Closing port final_core_transport_out
init 2023-07-06 22:39:47,224 DEBUG init: Sending message on final_core_transport_out
init 2023-07-06 22:39:47,224 DEBUG init: Closing port final_core_sources_out
init 2023-07-06 22:39:47,224 DEBUG init: Sending message on final_core_sources_out
init 2023-07-06 22:39:47,225 DEBUG init: Closing port final_transport_solver_numerics_out
init 2023-07-06 22:39:47,225 DEBUG init: Sending message on final_transport_solver_numerics_out
muscle_manager 2023-07-07 00:40:54,757 INFO libmuscle.manager.mmp_server: Deregistered instance init
<g2dpc@s53 ~/GIT/ets_paf>
Yeah, that makes sense, it's done at 22:39:47,222
, then it sends all the ClosePort
messages, and at that point it's really done, it just needs to wait for those messages to be retrieved their recipients. So I think I'll add a DISCONNECT_WAIT
event that measures from the point where it starts shutting down at 22:39:47,222
to the point where all the outboxes are empty and we can really shut down at 00:40:54
. Then I can take the start timestamp of that event as the end of the run for profiling purposes, and everything will make sense.
Fix released with version 0.7.1, please reopen if you encounter further issues.
Why does init show as active the whole time?
It basically has
as its pattern.
It reads some data, processes it, and then sends it to the main workflow, and then goes back to the resume_instance ...
Or should I find a way of shutting it down once it has finished sending?