multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Output final lines of log files on error for easier debugging #221

Closed LourensVeen closed 1 year ago

LourensVeen commented 1 year ago

This outputs the final lines of a crashed instance's stderr to the manager log, and the final lines of the manager log to the console, so that in simple cases you can see the problem right away and don't have to go digging through the files. Example output:

An error occurred during execution, and the simulation was
shut down. The manager log should tell you what happened.
Here are the final lines of the manager log:

--------------------------------------------------------------------------------

      File "docs/source/examples/python/build/venv/lib/python3.10/site-packages/libmuscle/mcp/tcp_util.py", line 25, in recv_all
        received_now = socket.recv_into(
    ConnectionResetError: [Errno 104] Connection reset by peer

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "docs/source/examples/python/diffusion.py", line 104, in <module>
        diffusion()
      File "docs/source/examples/python/diffusion.py", line 63, in diffusion
        msg = instance.receive('state_in', default=cur_state_msg)
      File "docs/source/examples/python/build/venv/lib/python3.10/site-packages/libmuscle/instance.py", line 461, in receive
        return self.__receive_message(port_name, slot, default, False)
      File "docs/source/examples/python/build/venv/lib/python3.10/site-packages/libmuscle/instance.py", line 877, in __receive_message
        msg, saved_until = self._communicator.receive_message(
      File "docs/source/examples/python/build/venv/lib/python3.10/site-packages/libmuscle/communicator.py", line 321, in receive_message
        raise RuntimeError(
    RuntimeError: Error while receiving a message: connection with peer 'micro' was lost. Did the peer crash?

muscle_manager 2023-04-15 20:27:35,753 ERROR   libmuscle.manager.instance_manager: More output may be found in docs/source/examples/run_reaction_diffusion_python_20230415_202735/instances/macro

muscle_manager 2023-04-15 20:27:35,753 ERROR   libmuscle.manager.instance_manager: Instance micro quit with error 2
muscle_manager 2023-04-15 20:27:35,753 ERROR   libmuscle.manager.instance_manager: The last error output of this instance was:
muscle_manager 2023-04-15 20:27:35,753 ERROR   libmuscle.manager.instance_manager: 

    INFO:libmuscle.instance:Received peer locations and base settings
    INFO:libmuscle.communicator:Connecting to peer macro at ['tcp:192.168.178.24:37937,[2a02:a46e:fb:1:ec17:81ff:642a:2541]:37937,[2a02:a46e:fb:1:81c2:7bd9:bb99:840c]:37937,192.168.122.1:37937,172.17.0.1:37937']

muscle_manager 2023-04-15 20:27:35,753 ERROR   libmuscle.manager.instance_manager: More output may be found in docs/source/examples/run_reaction_diffusion_python_20230415_202735/instances/micro

--------------------------------------------------------------------------------

You can find the full log at
docs/source/examples/run_reaction_diffusion_python_20230415_202735/muscle3_manager.log

(I sabotaged the example by adding an exit(2), so there's no useful error output, but if there were, it would have been printed!)