multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

After increasing the number of copies of my physics code from 10 to 20, I have started on seeing errors ... #247

Open DavidPCoster opened 1 year ago

DavidPCoster commented 1 year ago

After increasing the number of copies of my physics code from 10 to 20, I have started on seeing errors ...

Traceback (most recent call last): File "/afs/eufus.eu/user/g/g2dpc/GIT/ets_paf/UQ/load_balancer.py", line 60, in main() File "/afs/eufus.eu/user/g/g2dpc/GIT/ets_paf/UQ/load_balancer.py", line 34, in main while instance.reuse_instance(): File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 232, in reuse_instance do_reuse = self._decide_reuse_instance() File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 851, in _decide_reuse_instance got_f_init_messages = self._pre_receive() File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1018, in _pre_receive self.pre_receive_f_init() File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1084, in __pre_receive_f_init pre_receive(port_name, slot) File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/instance.py", line 1061, in pre_receive msg, saved_until = self._communicator.receive_message( File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/communicator.py", line 386, in receive_message _logger.debug('Received message on {}'.format(port_and_slot)) File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1465, in debug self._log(DEBUG, msg, args, **kwargs) File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1624, in _log self.handle(record) File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1634, in handle self.callHandlers(record) File "/gw/switm/python/3.10/lib/python3.10/logging/init.py", line 1696, in callHandlers hdlr.handle(record) File "/gw/switm/python/3.10/lib/python3.10/logging/init__.py", line 968, in handle self.emit(record) File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/logging_handler.py", line 31, in emit self._manager.submit_log_message(message) File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/mmp_client.py", line 137, in submit_log_message self._call_manager(request) File "/pfs/scratch/g2dpc/GIT/ets_paf/UQ/fusion_env/lib/python3.10/site-packages/libmuscle/mmp_client.py", line 285, in _call_manager return msgpack.unpackb(response, raw=False) File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb msgpack.exceptions.ExtraData: unpack(b) received extra data.

This is in a workflow consisting of an UQ component, a load balancer component, and then multiple copies of the physics code.

I have also seen a case that just hung where, I guess, everybody was waiting.

Is there a possible race condition?

LourensVeen commented 1 year ago

Oh wow, I haven't seen this one before. The load balancer is receiving a message and tries to log that, and that log message is sent to the manager (I guess the remote log level is set to DEBUG?). That's a simple RPC call, and the manager should return a simple "everything okay" acknowledgement in return, but evidently there's extra data being sent and msgpack is getting confused over that.

I'll have a look at the source code and see if I can figure out how that could happen. Some kind of race condition sounds like a possibility, if somehow there's another thread communicating with the manager at the same time over the same socket. It could be interference with the profiling information being uploaded, but I thought I'd put some mutexes in there to keep this from happening. I'll get back to it.

LourensVeen commented 1 year ago

Well, it seems like I forgot the mutex! And the tests produce very little data, so this never occurred there. I need to think about how to test these kinds of things better.

I've pushed a fix to the issue_247_mmp_race_condition branch, could you check to see if that fixes the issue?

DavidPCoster commented 1 year ago

It is going to take a few days to get to that -- other things that I delayed during the hackathon now have pressing deadlines.

LourensVeen commented 1 year ago

I'm in the same boat a bit, I'll try to reproduce this when I have time, because I need a test for this to avoid regressions. Some kind of stress test I guess.

DavidPCoster commented 1 year ago

OK -- while attending some zoom presentations I was able to install the issue_247_mmp_race_condition branch and have run some tests. All of the tests ran without a problem:

run_FUSION_MUSCLE3_20230629_131907/configuration.ymmsl: multiplicity: 10 run_FUSION_MUSCLE3_20230629_131907/instances/easyvvuq/stdout.txt:Time for phase 3 = 90.157 run_FUSION_MUSCLE3_20230629_135213/configuration.ymmsl: multiplicity: 15 run_FUSION_MUSCLE3_20230629_135213/instances/easyvvuq/stdout.txt:Time for phase 3 = 75.249 run_FUSION_MUSCLE3_20230629_132910/configuration.ymmsl: multiplicity: 20 run_FUSION_MUSCLE3_20230629_132910/instances/easyvvuq/stdout.txt:Time for phase 3 = 70.627 run_FUSION_MUSCLE3_20230629_135954/configuration.ymmsl: multiplicity: 25 run_FUSION_MUSCLE3_20230629_135954/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.026 run_FUSION_MUSCLE3_20230629_134428/configuration.ymmsl: multiplicity: 30 run_FUSION_MUSCLE3_20230629_134428/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.638 run_FUSION_MUSCLE3_20230629_133730/configuration.ymmsl: multiplicity: 40 run_FUSION_MUSCLE3_20230629_133730/instances/easyvvuq/stdout.txt:Time for phase 3 = 71.874

So -- I think you have fixed the problem. Thanks!

Many thanks!

LourensVeen commented 1 year ago

Ah, nice! I still can't believe I missed this, but I guess that's what being in a hurry does. Let's keep this issue open until I've made a stress test with a reasonable chance of catching this. Thanks for reporting!

LourensVeen commented 1 year ago

Fix released with 0.7.1, but I haven't got that stress test yet, so I'd like to leave this open for a bit longer until that's ready. It's not really fixed until we have a process for making sure it doesn't return.