Open DavidPCoster opened 1 year ago
Oh wow, I haven't seen this one before. The load balancer is receiving a message and tries to log that, and that log message is sent to the manager (I guess the remote log level is set to DEBUG?). That's a simple RPC call, and the manager should return a simple "everything okay" acknowledgement in return, but evidently there's extra data being sent and msgpack is getting confused over that.
I'll have a look at the source code and see if I can figure out how that could happen. Some kind of race condition sounds like a possibility, if somehow there's another thread communicating with the manager at the same time over the same socket. It could be interference with the profiling information being uploaded, but I thought I'd put some mutexes in there to keep this from happening. I'll get back to it.
Well, it seems like I forgot the mutex! And the tests produce very little data, so this never occurred there. I need to think about how to test these kinds of things better.
I've pushed a fix to the issue_247_mmp_race_condition
branch, could you check to see if that fixes the issue?
It is going to take a few days to get to that -- other things that I delayed during the hackathon now have pressing deadlines.
I'm in the same boat a bit, I'll try to reproduce this when I have time, because I need a test for this to avoid regressions. Some kind of stress test I guess.
OK -- while attending some zoom presentations I was able to install the issue_247_mmp_race_condition
branch and have run some tests. All of the tests ran without a problem:
run_FUSION_MUSCLE3_20230629_131907/configuration.ymmsl: multiplicity: 10 run_FUSION_MUSCLE3_20230629_131907/instances/easyvvuq/stdout.txt:Time for phase 3 = 90.157 run_FUSION_MUSCLE3_20230629_135213/configuration.ymmsl: multiplicity: 15 run_FUSION_MUSCLE3_20230629_135213/instances/easyvvuq/stdout.txt:Time for phase 3 = 75.249 run_FUSION_MUSCLE3_20230629_132910/configuration.ymmsl: multiplicity: 20 run_FUSION_MUSCLE3_20230629_132910/instances/easyvvuq/stdout.txt:Time for phase 3 = 70.627 run_FUSION_MUSCLE3_20230629_135954/configuration.ymmsl: multiplicity: 25 run_FUSION_MUSCLE3_20230629_135954/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.026 run_FUSION_MUSCLE3_20230629_134428/configuration.ymmsl: multiplicity: 30 run_FUSION_MUSCLE3_20230629_134428/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.638 run_FUSION_MUSCLE3_20230629_133730/configuration.ymmsl: multiplicity: 40 run_FUSION_MUSCLE3_20230629_133730/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.874
So -- I think you have fixed the problem. Thanks!
Many thanks!
Ah, nice! I still can't believe I missed this, but I guess that's what being in a hurry does. Let's keep this issue open until I've made a stress test with a reasonable chance of catching this. Thanks for reporting!
Fix released with 0.7.1, but I haven't got that stress test yet, so I'd like to leave this open for a bit longer until that's ready. It's not really fixed until we have a process for making sure it doesn't return.
After increasing the number of copies of my physics code from 10 to 20, I have started on seeing errors ...
Traceback (most recent call last): File "/afs/eufus.eu/user/g/g2dpc/GIT/ets_paf/UQ/load_balancer.py", line 60, in
main()
File "/afs/eufus.eu/user/g/g2dpc/GIT/ets_paf/UQ/load_balancer.py", line 34, in main
while instance.reuse_instance():
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 232, in reuse_instance
do_reuse = self._decide_reuse_instance()
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 851, in _decide_reuse_instance
got_f_init_messages = self._pre_receive()
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1018, in _pre_receive
self.pre_receive_f_init()
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1084, in __pre_receive_f_init
pre_receive(port_name, slot)
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1061, in pre_receive
msg, saved_until = self._communicator.receive_message(
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/communicator.py", line 386, in receive_message
_logger.debug('Received message on {}'.format(port_and_slot))
File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1465, in debug
self._log(DEBUG, msg, args, **kwargs)
File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1624, in _log
self.handle(record)
File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1634, in handle
self.callHandlers(record)
File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1696, in callHandlers
hdlr.handle(record)
File "/gw/switm/python/3.10/lib/python3.10/logging/init__.py", line 968, in handle
self.emit(record)
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/logging_handler.py", line 31, in emit
self._manager.submit_log_message(message)
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/mmp_client.py", line 137, in submit_log_message
self._call_manager(request)
File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/mmp_client.py", line 285, in _call_manager
return msgpack.unpackb(response, raw=False)
File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
This is in a workflow consisting of an UQ component, a load balancer component, and then multiple copies of the physics code.
I have also seen a case that just hung where, I guess, everybody was waiting.
Is there a possible race condition?