ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[kubernetes] ModuleNotFoundError when executing a task on a remote cluster #15668

Open ebr opened 3 years ago

ebr commented 3 years ago

What is the problem?

I am trying to run the Streaming MapReduce example on a remote cluster, but the wikipedia module can not be found, even though it's installed both locally and on all cluster nodes.

Reproduction (REQUIRED)

I built a custom docker image using the following Dockerfile:

FROM rayproject/ray:nightly-py38
RUN conda install -c conda-forge wikipedia
  1. The cluster is deployed on Kubernetes using Ray Operator, and confirmed working by running a simple task with no dependencies.
  2. All cluster nodes are using the above custom image.
  3. I confirmed that the wikipedia module is installed by running a container from this image, and importing the module in the Python shell
  4. I am using a virtual environment locally, created/managed by poetry. the wikipedia module is confirmed to be installed in the virtualenv.
  5. I modified the https://github.com/ray-project/ray/blob/master/doc/examples/streaming/streaming.py to connect to the cluster using port-forwarding, replacing ray.init() with ray.util.connect(....)

When trying to run the above example, I get the following stack trace:

Traceback (most recent call last):
  File "streaming.py", line 92, in <module>
    mappers = [Mapper.remote(stream) for stream in streams]
  File "streaming.py", line 92, in <listcomp>
    mappers = [Mapper.remote(stream) for stream in streams]
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/actor.py", line 418, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 316, in _invocation_actor_class_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/actor.py", line 572, in _remote
    return client_mode_convert_actor(
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 134, in client_mode_convert_actor
    return client_actor._remote(in_args, in_kwargs, **kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/common.py", line 232, in _remote
    return self.options(**option_args).remote(*args, **kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/common.py", line 345, in remote
    ref_ids = ray.call_remote(self, *args, **kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/api.py", line 103, in call_remote
    return self.worker.call_remote(instance, *args, **kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/worker.py", line 302, in call_remote
    task = instance._prepare_client_task()
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/common.py", line 338, in _prepare_client_task
    task = self.remote_stub._prepare_client_task()
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/common.py", line 243, in _prepare_client_task
    self._ensure_ref()
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/common.py", line 215, in _ensure_ref
    self._ref = ray.put(
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/api.py", line 52, in put
    return self.worker.put(*args, **kwargs)
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/worker.py", line 240, in put
    out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/worker.py", line 240, in <listcomp>
    out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/home/eugene/.cache/pypoetry/virtualenvs/ray-play-Watq55vL-py3.8/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in _put
    raise cloudpickle.loads(resp.error)
ModuleNotFoundError: No module named 'wikipedia'

The example does run locally as expected, if I leave ray.init() alone and don't connect to a remote cluster. Which leads me to believe this isn't an environment issue, but I'm not certain about that.

Is there anything else I need to do to make this module available on the remote nodes? It's also unclear to me whether both the head and worker nodes need to be running the custom image, or just the workers.

richardliaw commented 3 years ago

@ebr both the worker and head need to be running the custom image. Is that currently the case?

richardliaw commented 3 years ago

cc @DmitriGekhtman

trunghlt commented 3 years ago

We also got this issue frequently

sronen1 commented 3 years ago

I think I got a similar issue, on AWS, no k8s.