multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Tutorial example with no graphic output? #28

Closed diregoblin closed 4 years ago

diregoblin commented 4 years ago

The current python example is nice, but it doesn't run correctly if you don't have the X environment (e.g. when using ssh with no X forwarding).

For the record, in that particular case it produced the following (rather ugly) log:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/runner.py", line 109, in implementation_process
    implementation()
  File "./reaction_diffusion.py", line 106, in diffusion
    plt.figure()
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/matplotlib/pyplot.py", line 525, in figure
    **kwargs)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/matplotlib/backend_bases.py", line 3218, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/matplotlib/backends/_backend_tk.py", line 1008, in new_figure_manager_given_figure
    window = Tk.Tk(className="matplotlib")
  File "/usr/lib/python3.5/tkinter/__init__.py", line 1880, in __init__
    self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: couldn't connect to display "node-10n:10.0"
Process Instance-micro:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/runner.py", line 109, in implementation_process
    implementation()
  File "./reaction_diffusion.py", line 20, in reaction
    while instance.reuse_instance():
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/instance.py", line 107, in reuse_instance
    self.__pre_receive_f_init(apply_overlay)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/instance.py", line 586, in __pre_receive_f_init
    pre_receive(port_name, None)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/instance.py", line 572, in pre_receive
    msg = self._communicator.receive_message(port_name, slot)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/communicator.py", line 287, in receive_message
    mcp_message_bytes = client.receive(recv_endpoint.ref())
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/mcp/pipe_client.py", line 63, in receive
    return cast(bytes, self._conn.recv())
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Traceback (most recent call last):
  File "./reaction_diffusion.py", line 135, in <module>
    run_simulation(configuration, implementations)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/runner.py", line 266, in run_simulation
    run_instances(instances)
  File "/home/asvitenkov/muscle3_venv/lib/python3.5/site-packages/libmuscle/runner.py", line 225, in run_instances
    ', '.join(failed_names)))
RuntimeError: Instances Instance-micro, Instance-macro failed to shut down cleanly, please check the logs to see what went wrong.

Enabling X11 forwarding fixes this issue, of course. However, having a basic command-line-only example for testing on remote machines would be nice, I guess.

LourensVeen commented 4 years ago

That sounds good. I'm not sure whether it would be better to just remove the graphical output (it's kind of a distraction from the MUSCLE part that we're trying to demonstrate anyway) or to make it optional, so that at least things don't crash if python-tk or an X server is missing. The latter would add more distraction still, so I'm tempted to remove it, but on the other hand, having a nice picture on your screen is also nice...

About the backtrace, while it's not perfect, I'm not that unhappy with it actually:

Did you look at the log files? Are they helpful here?

diregoblin commented 4 years ago

Not really, I simply ran ssh -X when I saw that :)

I've repeated the issue, and the logs aren't very helpful here. Manager only contains the basic startup (I think similar to the normal run?), and macro & micro are empty.

muscle3_manager.log

diregoblin commented 4 years ago

Maybe you can keep two examples, one with graphic output and one without?

LourensVeen commented 4 years ago

Right, those logs are useless. I need to catch that exception in Instance.reuse_instance(), and log an error before quitting. Or even better, try to reconnect, in case it's just a dropped connection. But since those are rare on HPC machines, I haven't implemented that yet.

I like having two examples. And maybe leaving it as an exercise to the reader to swap out the implementation. Although I guess most readers would just comment out the plotting lines in the original...

LourensVeen commented 4 years ago

Okay, I've actually kept a single example, but made it possible to disable plotting by defining an environment variable. That should make things work on text-only machines. See commit 1c72f98.

I'll make a separate issue for better error messages when a peer disappears. I'll probably pick that up when I implement automatic start-up of components via a pilot job framework, as I need to consider life cycle then anyway, and right now I need to get Fortran support released.

LourensVeen commented 4 years ago

Better errors when a peer disappears is now #31. Closing this one, as the X issue is fixed. Please reopen if the fix does not work for you.