intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

TF2Estimator on Databricks cannot start ray #70

Closed jenniew closed 2 years ago

jenniew commented 2 years ago

Run TF2Estimator on Databricks cannot start ray and got such error:

Exception Traceback (most recent call last) in 96 # Create Instance of Model 97 #model = tm117_model.tm117_4_tfoptimizer() ---> 98 est = Estimator.from_keras(model_creator=tm117_model) 99 est.set_tensorboard(log_dir, app_name) 100 /databricks/python/lib/python3.8/site-packages/zoo/orca/learn/tf2/estimator.py in from_keras(model_creator, config, verbose, workers_per_node, compile_args_creator, backend, cpu_binding) 64 :param cpu_binding: (bool) Whether to binds threads to specific CPUs. Default: False 65 """ ---> 66 return TensorFlow2Estimator(model_creator=model_creator, config=config, 67 verbose=verbose, workers_per_node=workers_per_node, 68 backend=backend, compile_args_creator=compile_args_creator, /databricks/python/lib/python3.8/site-packages/zoo/orca/learn/tf2/estimator.py in init(self, model_creator, compile_args_creator, config, verbose, backend, workers_per_node, cpu_binding) 99 self.verbose = verbose 100 --> 101 ray_ctx = RayContext.get() 102 if "batch_size" in self.config: 103 raise Exception("Please do not specify batch_size in config. Input batch_size in the" /databricks/python/lib/python3.8/site-packages/zoo/ray/raycontext.py in get(cls, initialize) 452 ray_ctx = RayContext._active_ray_context 453 if initialize and not ray_ctx.initialized: --> 454 ray_ctx.init() 455 return ray_ctx 456 else: /databricks/python/lib/python3.8/site-packages/zoo/ray/raycontext.py in init(self, driver_cores) 539 self.cluster_ips = self._gather_cluster_ips() 540 redis_address = self._start_cluster() --> 541 self._address_info = self._start_driver(num_cores=driver_cores, 542 redis_address=redis_address) 543 /databricks/python/lib/python3.8/site-packages/zoo/ray/raycontext.py in _start_driver(self, num_cores, redis_address) 612 import ray._private.services 613 node_ip = ray._private.services.get_node_ip_address(redis_address) --> 614 self._start_restricted_worker(num_cores=num_cores, 615 node_ip_address=node_ip, 616 redis_address=redis_address) /databricks/python/lib/python3.8/site-packages/zoo/ray/raycontext.py in _start_restricted_worker(self, num_cores, node_ip_address, redis_address) 602 modified_env = self.ray_service._prepare_env() 603 print("Executing command: {}".format(command)) --> 604 process_info = session_execute(command=command, env=modified_env, 605 tag="raylet", fail_fast=True) 606 RayServiceFuncGenerator.start_ray_daemon("python", /databricks/python/lib/python3.8/site-packages/zoo/ray/process.py in session_execute(command, env, tag, fail_fast, timeout) 72 if errorcode != 0: 73 if fail_fast: ---> 74 raise Exception(err) 75 print(err) 76 else: Exception: Python path configuration: PYTHONHOME = '/databricks/python' PYTHONPATH = '/databricks/spark/python:/databricks/spark/python/lib/py4j-0.10.9-src.zip:/databricks/jars/spark--driver--driver-spark_3.1_2.12_deploy.jar:/databricks/spark/python:/databricks/jars/spark--maven-trees--ml--9.x--graphframes--org.graphframes--graphframes_2.12--org.graphframesgraphframes_2.12__0.8.1-db2-spark3.1.jar:/databricks/python_shell' program name = '/databricks/python3/bin/python' isolated = 0 environment = 1 user site = 1 import site = 1 sys._base_executable = '/databricks/python3/bin/python' sys.base_prefix = '/databricks/python' sys.base_exec_prefix = '/databricks/python' sys.executable = '/databricks/python3/bin/python' sys.prefix = '/databricks/python' sys.exec_prefix = '/databricks/python' sys.path = [ '/databricks/spark/python', '/databricks/spark/python/lib/py4j-0.10.9-src.zip', '/databricks/jars/spark--driver--driver-spark_3.1_2.12_deploy.jar', '/databricks/spark/python', '/databricks/jars/spark--maven-trees--ml--9.x--graphframes--org.graphframes--graphframes_2.12--org.graphframesgraphframes_2.12__0.8.1-db2-spark3.1.jar', '/databricks/python_shell', '/databricks/python/lib/python38.zip', '/databricks/python/lib/python3.8', '/databricks/python/lib/python3.8/lib-dynload', ] Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding Python runtime state: core initialized ModuleNotFoundError: No module named 'encodings' Current thread 0x00007fd05c5fd740 (most recent call first): <no Python frame

louie-tsai commented 2 years ago

Hi Jennie, could you share how we reproduce the issue?

Louie

jenniew commented 2 years ago

I shared a notebook to you in teams.

jenniew commented 2 years ago

Reproduced on Databricks cluster. This seems a python environment issue of Databricks.

jenniew commented 2 years ago

Mastercard asked Databricks to fix ray environment issue.