logicalclocks / maggy

Distribution transparent Machine Learning experiments on Apache Spark
https://maggy.ai
Apache License 2.0
89 stars 14 forks source link

Consecutive experiments on Databricks failing #103

Closed moritzmeister closed 3 years ago

moritzmeister commented 3 years ago

Find out why the sleep is needed

RiccardoGrigoletto commented 3 years ago

Here's the error log I am getting when running the second experiment right after the first one. If I rerun the same cell after few seconds, it will work.

OSError Traceback (most recent call last)

in ----> 1 experiment.lagom(training_function, training_config) /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/experiment.py in lagom(train_fn, config) 74 APP_ID, RUN_ID = util.register_environment(APP_ID, RUN_ID) 75 driver = lagom_driver(config, APP_ID, RUN_ID) ---> 76 return driver.run_experiment(train_fn) 77 except: # noqa: E722 78 _exception_handler(util.seconds_to_milliseconds(time.time() - job_start)) /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/experiment_driver/driver.py in run_experiment(self, train_fn) 134 return result 135 except Exception as exc: # pylint: disable=broad-except --> 136 self._exp_exception_callback(exc) 137 finally: 138 # Grace period to send last logs to sparkmagic. /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/experiment_driver/tf_distributed_training_driver.py in _exp_exception_callback(self, exc) 81 automatically on the workers for you.""" 82 ) from exc ---> 83 raise exc 84 85 def _patching_fn(self, train_fn: Callable) -> Callable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/experiment_driver/driver.py in run_experiment(self, train_fn) 116 ) 117 ) --> 118 self.init(job_start) 119 # Create a spark rdd partitioned into single integers, one for each executor. Allows 120 # execution of functions on each executor node. /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/experiment_driver/driver.py in init(self, job_start) 178 :param job_start: Time of the job start. 179 """ --> 180 self.server_addr = self.server.start(self) 181 self.job_start = job_start 182 self._start_worker() /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/rpc.py in start(self, exp_driver) 254 server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) 255 server_sock, SERVER_HOST_PORT = EnvSing.get_instance().connect_host( --> 256 server_sock, SERVER_HOST_PORT, exp_driver 257 ) 258 /local_disk0/.ephemeral_nfs/envs/pythonEnv-5e352dcb-3395-4fa6-afe7-3a2fb7be0753/lib/python3.7/site-packages/maggy/core/environment/base.py in connect_host(self, server_sock, server_host_port, exp_driver) 143 144 else: --> 145 server_sock.bind(server_host_port) 146 147 server_sock.listen(10) OSError: [Errno 98] Address already in use
moritzmeister commented 3 years ago

closed by #106