After @mkendler previously reported this problem, I reproduced this bug on acluster (using an allocated node with local Runner, proFit 0.5.dev22+g119ccff for python 3.9.2). Using interface: zeromq results in a ConnectionError:
Traceback (most recent call last):
File "/home/oswell/venv/bin/profit", line 33, in <module>
sys.exit(load_entry_point('profit', 'console_scripts', 'profit')())
File "/home/oswell/profit/profit/main.py", line 88, in main
runner.spawn_array(tqdm(params_array), blocking=True)
File "/home/oswell/profit/profit/run/default.py", line 63, in spawn_array
self.spawn_run(params)
File "/home/oswell/profit/profit/run/default.py", line 48, in spawn_run
worker = Worker.from_config(self.run_config, self.next_run_id)
File "/home/oswell/profit/profit/run/worker.py", line 185, in from_config
return cls[config['worker']](config, interface, pre, post, run_id)
File "/home/oswell/profit/profit/run/worker.py", line 178, in __init__
self.interface: Interface = interface_class(config['interface'], run_id, logger_parent=self.logger)
File "/home/oswell/profit/profit/run/zeromq.py", line 92, in __init__
self.request('READY') # self.input, self.output
File "/home/oswell/profit/profit/run/zeromq.py", line 159, in request
raise ConnectionError('could not connect to RunnerInterface')
ConnectionError: could not connect to RunnerInterface
log/run_000.log:
2021-11-07 15:48:44,147 INFO Interface: connected to tcp://localhost:9100
2021-11-07 15:48:46,651 WARNING Interface: READY: no response
2021-11-07 15:48:47,652 INFO Interface: connected to tcp://localhost:9100
2021-11-07 15:48:50,155 WARNING Interface: READY: no response
2021-11-07 15:48:51,156 INFO Interface: connected to tcp://localhost:9100
2021-11-07 15:48:53,659 WARNING Interface: READY: no response
2021-11-07 15:48:54,660 INFO Interface: connected to tcp://localhost:9100
2021-11-07 15:48:57,163 WARNING Interface: READY: no response
2021-11-07 15:48:58,164 ERROR Interface: READY: 4 requests unsuccessful, abandoning
Setting PROFIT_RUNNER_ADDRESS to the hostname manually doesn’t help.
Interestingly test_zeromq (tests/run/test_components.py) passes. (with the only obvious difference being the test using a thread while the run uses a subprocess.
Workaround: using fork: false for the local Runner works without error
Using the ZeroMQ Interface with the Slurm Runner works just fine as well. I therefore don’t think it is a problem with the ZeroMQ Interface but rather with the forked Workers.
After @mkendler previously reported this problem, I reproduced this bug on
acluster
(using an allocated node with local Runner,proFit 0.5.dev22+g119ccff for python 3.9.2
). Usinginterface: zeromq
results in aConnectionError
:log/run_000.log
:PROFIT_RUNNER_ADDRESS
to thehostname
manually doesn’t help.test_zeromq
(tests/run/test_components.py
) passes. (with the only obvious difference being the test using a thread while the run uses a subprocess.fork: false
for the local Runner works without error