intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
17 stars 4 forks source link

Ray serialization cannot find my project packages. #494

Open GitEasonXu opened 3 years ago

GitEasonXu commented 3 years ago

Worker cannot find my projecet modules when ray start serialization.

JavaGatewayServer has been successfully launched on executors
Start to launch ray on cluster
Start to launch ray driver on local
Executing command: ray start --address 192.168.1.1:65319 --redis-password 123456 --num-cpus 0 --node-ip-address 192.168.1.2
2020-12-23 09:08:35,438 WARNING worker.py:792 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.

2020-12-23 09:08:35,345 INFO scripts.py:429 -- Using IP address 192.168.1.2 for this node.
2020-12-23 09:08:35,349 INFO resource_spec.py:212 -- Starting Ray with 161.13 GiB memory available for workers and up to 69.08 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-12-23 09:08:35,350 WARNING services.py:1470 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-12-23 09:08:35,371 INFO scripts.py:438 -- 
Started Ray on this node. If you wish to terminate the processes that have been started, run

    ray stop

{'node_ip_address': '192.168.1.2', 'redis_address': '192.168.1.1:65319', 'object_store_address': '/tmp/ray/session_2020-12-23_04-08-27_331738_61806/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-12-23_04-08-27_331738_61806/sockets/raylet', 'webui_url': 'localhost:8302', 'session_dir': '/tmp/ray/session_2020-12-23_04-08-27_331738_61806'}
(pid=61956, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000005/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61956, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000005/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=62085, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=62085, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61835, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000003/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61835, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000003/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61834, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000003/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61887, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000004/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61887, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000004/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61886, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000004/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61886, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000004/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61955, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000005/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=61955, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000005/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=62086, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=62086, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61834, ip=192.168.1.1) Prepending /hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000003/python_env/lib/python3.6/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=61956, ip=192.168.1.1) 2020-12-23 04:08:36.069998: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61956, ip=192.168.1.1) 2020-12-23 04:08:36.070065: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=62085, ip=192.168.1.1) 2020-12-23 04:08:36.064976: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=62085, ip=192.168.1.1) 2020-12-23 04:08:36.065026: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=61955, ip=192.168.1.1) 2020-12-23 04:08:36.194650: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61955, ip=192.168.1.1) 2020-12-23 04:08:36.194692: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=61835, ip=192.168.1.1) 2020-12-23 04:08:36.238568: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61835, ip=192.168.1.1) 2020-12-23 04:08:36.238613: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=61834, ip=192.168.1.1) 2020-12-23 04:08:36.268510: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61834, ip=192.168.1.1) 2020-12-23 04:08:36.268566: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=61887, ip=192.168.1.1) 2020-12-23 04:08:36.238568: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61887, ip=192.168.1.1) 2020-12-23 04:08:36.238617: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=61886, ip=192.168.1.1) 2020-12-23 04:08:36.238568: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=61886, ip=192.168.1.1) 2020-12-23 04:08:36.238617: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=62086, ip=192.168.1.1) 2020-12-23 04:08:36.238568: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/jdk64/java/jre/lib/amd64/server:/usr/hdp/3.0.1.0-187/usr/lib/
(pid=62086, ip=192.168.1.1) 2020-12-23 04:08:36.238620: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
---------------------------------------------------------------------------
RayTaskError(ModuleNotFoundError)         Traceback (most recent call last)
<timed exec> in <module>

<ipython-input-6-7e45289d08d3> in main(max_epoch)
      6     }
      7 
----> 8     est = Estimator.from_keras(model_creator, config=config, workers_per_node=2)
      9 
     10     # callback

~/anaconda3/envs/zoo-tf2.3/lib/python3.6/site-packages/zoo/orca/learn/tf2/tf_ray_estimator.py in from_keras(cls, model_creator, config, verbose, workers_per_node, compile_args_creator, backend)
    189         return cls(model_creator, config=config,
    190                    verbose=verbose, workers_per_node=workers_per_node,
--> 191                    backend=backend, compile_args_creator=compile_args_creator)
    192 
    193     def fit(self, data_creator, epochs=1, verbose=1,

~/anaconda3/envs/zoo-tf2.3/lib/python3.6/site-packages/zoo/orca/learn/tf2/tf_ray_estimator.py in __init__(self, model_creator, compile_args_creator, config, verbose, backend, workers_per_node)
    151                                    for i in range(0, num_nodes)]
    152             ips = ray.get(
--> 153                 [worker.get_node_ip.remote() for worker in self.remote_workers])
    154             ports = ray.get(
    155                 [worker.find_free_port.remote() for worker in self.remote_workers])

~/anaconda3/envs/zoo-tf2.3/lib/python3.6/site-packages/ray/worker.py in get(object_ids, timeout)
   1511                     worker.core_worker.dump_object_store_memory_usage()
   1512                 if isinstance(value, RayTaskError):
-> 1513                     raise value.as_instanceof_cause()
   1514                 else:
   1515                     raise value

RayTaskError(ModuleNotFoundError): ray::IDLE (pid=62085, ip=192.168.1.1)
  File "python/ray/_raylet.pyx", line 414, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 417, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "/hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/ray/serialization.py", line 323, in deserialize_objects
    self._deserialize_object(data, metadata, object_id))
  File "/hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/ray/serialization.py", line 271, in _deserialize_object
    return self._deserialize_pickle5_data(data)
  File "/hadoop/yarn/local/usercache/hdfs/appcache/application_1608014057528_0041/container_e15_1608014057528_0041_01_000002/python_env/lib/python3.6/site-packages/ray/serialization.py", line 262, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'object_detection'
yangw1234 commented 3 years ago

You can try init_orca_context(..., extra_python_lib='your/local/package'). @hkvision Am I right?

GitEasonXu commented 3 years ago

It is not work that init_orca_context add extra_python_lib config. When I submit task to yarn manager, zoo pack current python env whether only include conda environment package and not the project file? So how does the yarn worker load the project file.

yangw1234 commented 3 years ago

@GitEasonXu The argument passed to extra_python_lib will be passed to --py-files in spark-submit command. The value should be a comma seperated .py, .zip or .egg files.

How do you use init_orca_context with extra_python_lib? Could you provide your code?

yangw1234 commented 3 years ago

@GitEasonXu did you try the suggestions above?

GitEasonXu commented 3 years ago

I've just tried your suggestions, but cannot solve the problem. Here are some code snippets.

import time
cluster_mode = "yarn"
extra_python_package='my_project_root_path'
if cluster_mode == "local":
    init_orca_context(cluster_mode="local", cores=4, init_ray_on_spark=True)
elif cluster_mode == "yarn":
    init_orca_context(cluster_mode="yarn-client", num_nodes=4, cores=10,
                      init_ray_on_spark=True, memory="16g", driver_memory="16g",
                      hadoop_user_name='hdfs', hadoop_conf="/etc/hadoop/3.0.1.0-187/0/",
                      extra_python_lib=extra_python_package)
yangw1234 commented 3 years ago

hi @GitEasonXu , as I said, extra python lib should be a path to .py , .zip or .egg files, and directory is not supported. Extra python lib has the same semantic as spark —py-files, you can try it on spark first, If it works on spark, it should work on az.

yangw1234 commented 3 years ago

@hkvision hi Kai do you think it is possible to automatically pack python packages in the current directory? Say packing all .py files and all directory with init.py files in the current directory.

hkvision commented 3 years ago

with modulepickle?

yangw1234 commented 3 years ago

with modulepickle?

Wouldn't it be a little dangerous to override all the cloudpickle usage in both spark and ray?