ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.53k stars 5.69k forks source link

[Core] std::bad_alloc error using ray.init() #33525

Open Yue-Li-atBain opened 1 year ago

Yue-Li-atBain commented 1 year ago

What happened + What you expected to happen

I was trying to run some code with ray inside a docker image, but ray.init() throws std::bad_alloc error. The error remains even if I set object memory or _memory to less than 1GB.

Versions / Dependencies

The docker image was build with the following packages: first continuumio/miniconda3:4.10.3

channels:

Reproduction script

The error occurs with the first call of ray.init()

Issue Severity

High: It blocks me from completing my task.

cadedaniel commented 1 year ago

Hi @Yue-Li-atBain , do you have a Dockerfile that installs these dependencies in a way that reproduces the issue?

Yue-Li-atBain commented 1 year ago

The docker file content is as follows. The environment.yaml contains the packages list above

FROM continuumio/miniconda3:4.10.3 AS main

RUN apt-get -y --allow-releaseinfo-change update && \
    apt-get -y install build-essential && \
    apt-get -y install dos2unix # required to execute ops\clean_files.sh on windows
RUN conda config --set ssl_verify false
RUN conda update -n base -c defaults conda

WORKDIR /opt/app

COPY --chown=1000:100 environment.yaml .
RUN conda env update -q --name base --file environment.yaml
RUN pip install flask-session==0.4.0

COPY src src
COPY setup.py .
RUN pip install -e .
cadedaniel commented 1 year ago

Can you share environment.yaml as well? Want to make sure we can reproduce exactly what you're seeing :). Also, how is Ray installed? I don't see it in the packages listed above or this dockerfile.

Lastly, what is pip install -e . installing?

Yue-Li-atBain commented 1 year ago

environment.yaml file is as follows:

channels: 
  - conda-forge
dependencies: 
  #core
  - ray-core = 2.3.0
  - tensorflow = 2.9
  - tensorflow-probability
  - pandas=1.3.4
  - u8darts-all=0.19.0
  - jupyterlab=3.2.4
  - ipython=8.3.0
  - numpy=1.22.3
  - pillow=9.1.1
  - ujson=5.4.0
  - jinja2=3.1.2
  - jupyter_server=1.17.1
  - notebook=6.4.12
  - hydra-core=1.2
  - openpyxl=3.0.9
  - mlflow=1.26.0
  - libstdcxx-ng
  - plotly=5.8.2
  #dev
  - pytest=6.2.5
  - pytest-cov=3.0.0
  - sphinx=4.3.0
  - black=22.3.0
  - pytest-helpers-namespace=2021.12.29
  - pip
  - pip:
    - streamlit==1.11.1
    - streamlit-aggrid==0.3.4.post3
    - rsconnect-python==1.15.0
    - statsforecast==0.7.1 # 1.0.0 is does not work with the current version of DARTS
    - pytest-regtest==1.5.0
Yue-Li-atBain commented 1 year ago

so sorry, I pasted the older version of environment.yaml file before. Now it's updated. I also updated the docker file. Before

 pip install -e .

there is a very simple setup.py file and src folder being copied. The setup.py file is like this:

from setuptools import setup, find_packages

setup(
    name="src",
    version="0.0.0",
    packages=["src"],
    python_requires=">=3.9",
    # install_requires=["peppercorn"],  # Optional
)
Yue-Li-atBain commented 1 year ago

were you able to reproduce?

mattip commented 1 year ago

Does the base environment as defined by the environment.yml work (ray.init() runs without failing)? I mean, if you use only this part of the Dockerfile, does it still segfault? Also: which version of python does this run with?

FROM continuumio/miniconda3:4.10.3 AS main

RUN apt-get -y --allow-releaseinfo-change update && \
    apt-get -y install build-essential && \
    apt-get -y install dos2unix # required to execute ops\clean_files.sh on windows
RUN conda config --set ssl_verify false
RUN conda update -n base -c defaults conda

WORKDIR /opt/app

COPY --chown=1000:100 environment.yaml .
RUN conda env update -q --name base --file environment.yaml
Yue-Li-atBain commented 1 year ago

Yes, it still segfault.

mattip commented 1 year ago

Weird. When I use that Dockerfile and environment.yaml, the conda env update -q ... line hangs for me and cannot get past solving the environment

...
Step 5/7 : WORKDIR /opt/app
 ---> Running in 1db0dfd64f96
Removing intermediate container 1db0dfd64f96
 ---> 50458d8acf24
Step 6/7 : COPY --chown=1000:100 environment.yaml .
 ---> c63175c89352
Step 7/7 : RUN conda env update -q --name base --file environment.yaml
 ---> Running in aa5841578ef8
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working...   
mattip commented 1 year ago

Whew. It built. And for me ray.init() works, although I do see a warning about /tmp being full. Perhaps running docker with a "-v" directive to map /tmp to a filesystem outside the docker image would help?

$ docker build -f Dockerfile .
...
$ docker run -it --rm 07243a46c39a /bin/bash
(base) root@9aa20e3a2307:/opt/app# python
Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init()
2023-04-28 13:16:52,511 WARNING services.py:1780 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-04-28 13:16:52,627 INFO worker.py:1553 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.9.5', ray_version='2.3.0', ray_commit='{{RAY_COMMIT_SHA}}', address_info={'node_ip_address': '172.17.0.2', 'raylet_ip_address': '172.17.0.2', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-04-28_13-16-50_982369_283/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-04-28_13-16-50_982369_283/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2023-04-28_13-16-50_982369_283', 'metrics_export_port': 50146, 'gcs_address': '172.17.0.2:55455', 'address': '172.17.0.2:55455', 'dashboard_agent_listen_port': 52365, 'node_id': '1c205d542df5dd250303c70b186db6cf23aa24ccb4acc7a3a3b58829'})
>>> (raylet) [2023-04-28 13:17:02,527 E 407 425] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-28_13-16-50_982369_283 is over 95% full, available space: 9684275200; capacity: 502921633792. Object creation will fail if spilling is required.
mattip commented 1 year ago

and of course you can increase space by following the warning:

you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run...