Open fatenlouati opened 1 year ago
update this is the track:
--> 314 est = Estimator.from_keras(model_creator=model, workers_per_node=5)
/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/estimator.py in from_keras(model_creator, config, verbose, workers_per_node, compile_args_creator, backend, cpu_binding, log_to_driver, model_dir, **kwargs)
69 if backend in {"ray", "horovod"}:
70 from bigdl.orca.learn.tf2.ray_estimator import TensorFlow2Estimator
---> 71 return TensorFlow2Estimator(model_creator=model_creator, config=config,
72 verbose=verbose, workers_per_node=workers_per_node,
73 backend=backend, compile_args_creator=compile_args_creator,
/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/ray_estimator.py in __init__(self, model_creator, compile_args_creator, config, verbose, backend, workers_per_node, cpu_binding)
116 urls = ["{ip}:{port}".format(ip=ips[i], port=ports[i])
117 for i in range(len(self.remote_workers))]
--> 118 ray.get([worker.setup.remote() for worker in self.remote_workers])
119 # Get setup tasks in order to throw errors on failure
120 ray.get([
/databricks/python/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
106
107 return wrapper
/databricks/python/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
1711 worker.core_worker.dump_object_store_memory_usage()
1712 if isinstance(value, RayTaskError):
-> 1713 raise value.as_instanceof_cause()
1714 else:
1715 raise value
RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>)
File "/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/tf_runner.py", line 271, in setup
tf.config.threading.set_inter_op_parallelism_threads(self.inter_op_parallelism)
File "/databricks/python/lib/python3.8/site-packages/tensorflow/python/framework/config.py", line 144, in set_inter_op_parallelism_threads
context.context().inter_op_parallelism_threads = num_threads
File "/databricks/python/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 1841, in inter_op_parallelism_threads
raise RuntimeError(
RuntimeError: Inter op parallelism cannot be modified after initialization.
Hi @fatenlouati , I tried to reproduce the error in your code, but was not successful. I was able to successfully run a TensorFlow Estimator with Ray backend application on Databricks. Could you please share your Databricks cluster configuration and some sample code? This would be very helpful for us to analyze and resolve your issue😄.
Thank ou @sgwhat,this is my cluster configuration. For the code I train my model (RL) with multiple iterations.
when i run with backend="spark"
, after some iterations it stop with this error:
org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.
what could be the problem?
my be because I use 14'days free trial, there are some limits?
Hi @fatenlouati ,
I believe this error is caused by either cluster resources limits or non-uniform input data, possibly related to the free trial. Could you provide a portion of your code to help us pinpoint the root cause of the issue?
By the way, this error may also be related to your configuration, you may refer to our configuration https://bigdl.readthedocs.io/en/latest/doc/UserGuide/databricks.html#set-spark-configuration to restart the cluster and also refer to this known issue to solve the error you met with ray estimator.
when running my code in databricks using bigdl-orca, I got this error; `RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>)
the log:
RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>)
any help to fix this issue please, because when usingbackend="spark"
, it requires more resources. it seems thatspark backend
does not distribute workloads across multiple nodes as done withray
. thank you