ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.3k stars 5.63k forks source link

[tune] AMD-Instinct-MI250X falsely shown as unused #45684

Open gregordecristoforo opened 4 months ago

gregordecristoforo commented 4 months ago

What happened + What you expected to happen

I am using ray tune on the LUMI supercomputer on one whole GPU node. The node contains four AMD MI250X GPUs (with two GPU dies each).

The output of the script contains the following:

Logical resource usage: 56.0/56 CPUs, 8.0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI250X)

which shows that all the GPUs are utilized as intended (checking with rocm-smi gives the same result). However, the statement 0.0/1.0 accelerator_type:AMD-Instinct-MI250X is clearly incorrect. Shouldn't it show 1/1 or even 8/8 accelerator_type:AMD-Instinct-MI250X? Please let me know if any additional information is required.

Versions / Dependencies

Ray: 2.12.0 Python: 3.11.9 OS: Linux 5.14.21-150400.24.81_12.0.75-cray_shasta_c x86_64

The whole conda environment looks as follows:

packages in environment at /opt/conda/envs/conda_container_env:

Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge accelerate 0.29.0 pypi_0 pypi aiohttp 3.9.5 py311h459d7ec_0 conda-forge aiohttp-cors 0.7.0 py_0 conda-forge aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge annotated-types 0.7.0 pyhd8ed1ab_0 conda-forge async-timeout 4.0.3 pyhd8ed1ab_0 conda-forge attrs 23.2.0 pyh71513ae_0 conda-forge aws-c-auth 0.7.11 h0b4cabd_1 conda-forge aws-c-cal 0.6.9 h14ec70c_3 conda-forge aws-c-common 0.9.12 hd590300_0 conda-forge aws-c-compression 0.2.17 h572eabf_8 conda-forge aws-c-event-stream 0.4.1 h97bb272_2 conda-forge aws-c-http 0.8.0 h9129f04_2 conda-forge aws-c-io 0.14.0 hf8f278a_1 conda-forge aws-c-mqtt 0.10.1 h2b97f5f_0 conda-forge aws-c-s3 0.4.9 hca09fc5_0 conda-forge aws-c-sdkutils 0.1.13 h572eabf_1 conda-forge aws-checksums 0.1.17 h572eabf_7 conda-forge aws-crt-cpp 0.26.0 h04327c0_8 conda-forge aws-sdk-cpp 1.11.210 hba3e011_10 conda-forge brotli-python 1.1.0 py311hb755f60_1 conda-forge bzip2 1.0.8 hd590300_5 conda-forge c-ares 1.28.1 hd590300_0 conda-forge ca-certificates 2024.2.2 hbcca054_0 conda-forge cachetools 5.3.3 pyhd8ed1ab_0 conda-forge certifi 2024.2.2 pyhd8ed1ab_0 conda-forge cffi 1.16.0 py311hb3a22ac_0 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge click 8.1.7 unix_pyh707e725_0 conda-forge colorama 0.4.6 pyhd8ed1ab_0 conda-forge colorful 0.5.6 pyhd8ed1ab_0 conda-forge cryptography 42.0.7 py311h4a61cc7_0 conda-forge datasets 2.18.0 pyhd8ed1ab_0 conda-forge dill 0.3.8 pyhd8ed1ab_0 conda-forge distlib 0.3.8 pyhd8ed1ab_0 conda-forge filelock 3.13.4 pyhd8ed1ab_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge frozenlist 1.4.1 py311h459d7ec_0 conda-forge fsspec 2024.2.0 pyhca7485f_0 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge glog 0.6.0 h6f12383_0 conda-forge gmp 6.3.0 h59595ed_1 conda-forge gmpy2 2.1.5 py311hc4f1f91_1 conda-forge google-api-core 2.19.0 pyhd8ed1ab_0 conda-forge google-auth 2.29.0 pyhca7485f_0 conda-forge googleapis-common-protos 1.63.0 pyhd8ed1ab_0 conda-forge grpcio 1.59.3 py311ha6695c7_0 conda-forge huggingface_hub 0.22.2 pyhd8ed1ab_0 conda-forge icu 73.2 h59595ed_0 conda-forge idna 3.7 pyhd8ed1ab_0 conda-forge importlib-metadata 7.1.0 pyha770c72_0 conda-forge importlib_resources 6.4.0 pyhd8ed1ab_0 conda-forge jinja2 3.1.3 pyhd8ed1ab_0 conda-forge jsonschema 4.22.0 pyhd8ed1ab_0 conda-forge jsonschema-specifications 2023.12.1 pyhd8ed1ab_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge krb5 1.21.2 h659d440_0 conda-forge lcms2 2.16 hb7c19ff_0 conda-forge ld_impl_linux-64 2.40 h55db66e_0 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20230802.1 cxx17_h59595ed_0 conda-forge libarrow 15.0.0 h84dd17c_0_cpu conda-forge libarrow-acero 15.0.0 h59595ed_0_cpu conda-forge libarrow-dataset 15.0.0 h59595ed_0_cpu conda-forge libarrow-flight 15.0.0 h120cb0d_0_cpu conda-forge libarrow-flight-sql 15.0.0 h61ff412_0_cpu conda-forge libarrow-gandiva 15.0.0 hacb8726_0_cpu conda-forge libarrow-substrait 15.0.0 h61ff412_0_cpu conda-forge libblas 3.9.0 22_linux64_openblas conda-forge libbrotlicommon 1.1.0 hd590300_1 conda-forge libbrotlidec 1.1.0 hd590300_1 conda-forge libbrotlienc 1.1.0 hd590300_1 conda-forge libcblas 3.9.0 22_linux64_openblas conda-forge libcrc32c 1.1.2 h9c3ff4c_0 conda-forge libcurl 8.7.1 hca28451_0 conda-forge libdeflate 1.20 hd590300_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 hd590300_2 conda-forge libevent 2.1.12 hf998b51_1 conda-forge libexpat 2.6.2 h59595ed_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.2.0 h77fa898_7 conda-forge libgfortran-ng 13.2.0 h69a702a_7 conda-forge libgfortran5 13.2.0 hca663fb_7 conda-forge libgomp 13.2.0 h77fa898_7 conda-forge libgoogle-cloud 2.12.0 h5206363_4 conda-forge libgrpc 1.59.3 hd6c4280_0 conda-forge libiconv 1.17 hd590300_2 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 22_linux64_openblas conda-forge libllvm15 15.0.7 hb3ce162_4 conda-forge libnghttp2 1.58.0 h47da74e_1 conda-forge libnl 3.9.0 hd590300_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libopenblas 0.3.27 pthreads_h413a1c8_0 conda-forge libparquet 15.0.0 h352af49_0_cpu conda-forge libpng 1.6.43 h2797004_0 conda-forge libprotobuf 4.24.4 hf27288f_0 conda-forge libre2-11 2023.09.01 h7a70373_1 conda-forge libsqlite 3.45.3 h2797004_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge libthrift 0.19.0 hb90f79a_1 conda-forge libtiff 4.6.0 h1dd3fc0_3 conda-forge libunwind 1.6.2 h9c3ff4c_0 conda-forge libutf8proc 2.8.0 h166bdaf_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libuv 1.48.0 hd590300_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.15 h0b41bf4_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxml2 2.12.7 hc051c1a_0 conda-forge libzlib 1.2.13 hd590300_5 conda-forge lz4-c 1.9.4 hcb278e6_0 conda-forge markdown-it-py 3.0.0 pyhd8ed1ab_0 conda-forge markupsafe 2.1.5 py311h459d7ec_0 conda-forge mdurl 0.1.2 pyhd8ed1ab_0 conda-forge memray 1.12.0 py311h259950f_0 conda-forge mpc 1.3.1 hfe3b2da_0 conda-forge mpfr 4.2.1 h9458935_1 conda-forge mpmath 1.3.0 pyhd8ed1ab_0 conda-forge msgpack-python 1.0.8 py311h52f7536_0 conda-forge multidict 6.0.5 py311h459d7ec_0 conda-forge multiprocess 0.70.16 py311h459d7ec_0 conda-forge ncurses 6.5 h59595ed_0 conda-forge networkx 3.3 pyhd8ed1ab_1 conda-forge nodejs 20.12.2 hb753e55_0 conda-forge numpy 1.26.4 py311h64a7726_0 conda-forge opencensus 0.11.3 pyhd8ed1ab_0 conda-forge opencensus-context 0.1.3 py311h38be061_2 conda-forge openjpeg 2.5.2 h488ebb8_0 conda-forge openssl 3.3.0 h4ab18f5_2 conda-forge orc 1.9.2 h4b38347_0 conda-forge packaging 24.0 pyhd8ed1ab_0 conda-forge pandas 2.2.2 py311h14de704_1 conda-forge pillow 10.3.0 py311h18e6fac_0 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge platformdirs 3.11.0 pyhd8ed1ab_0 conda-forge prometheus_client 0.20.0 pyhd8ed1ab_0 conda-forge proto-plus 1.23.0 pyhd8ed1ab_0 conda-forge protobuf 4.24.4 py311h46cbc50_0 conda-forge psutil 5.9.8 py311h459d7ec_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge py-spy 0.3.14 h87a5ac0_0 conda-forge pyarrow 15.0.0 py311h39c9aba_0_cpu conda-forge pyarrow-hotfix 0.6 pyhd8ed1ab_0 conda-forge pyasn1 0.6.0 pyhd8ed1ab_0 conda-forge pyasn1-modules 0.4.0 pyhd8ed1ab_0 conda-forge pycparser 2.22 pyhd8ed1ab_0 conda-forge pydantic 2.7.1 pyhd8ed1ab_0 conda-forge pydantic-core 2.18.2 py311h5ecf98a_0 conda-forge pygments 2.18.0 pyhd8ed1ab_0 conda-forge pyopenssl 24.0.0 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge python 3.11.9 hb806964_0_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge python-xxhash 3.4.1 py311h459d7ec_0 conda-forge python_abi 3.11 4_cp311 conda-forge pytorch-triton-rocm 2.2.0 pypi_0 pypi pytz 2024.1 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyyaml 6.0.1 py311h459d7ec_1 conda-forge ray-core 2.12.0 py311h3a73429_0 conda-forge ray-default 2.12.0 py311h48098de_0 conda-forge ray-tune 2.12.0 py311h38be061_0 conda-forge rdma-core 51.0 hd3aeb46_0 conda-forge re2 2023.09.01 h7f4b329_1 conda-forge readline 8.2 h8228510_1 conda-forge referencing 0.35.1 pyhd8ed1ab_0 conda-forge regex 2024.5.15 py311h331c9d8_0 conda-forge requests 2.32.1 pyhd8ed1ab_0 conda-forge rich 13.7.1 pyhd8ed1ab_0 conda-forge rpds-py 0.18.1 py311h5ecf98a_0 conda-forge rsa 4.9 pyhd8ed1ab_0 conda-forge s2n 1.4.1 h06160fa_0 conda-forge safetensors 0.4.3 py311h46250e7_0 conda-forge setproctitle 1.3.3 py311h459d7ec_0 conda-forge setuptools 69.5.1 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge smart_open 7.0.4 pyhd8ed1ab_0 conda-forge snappy 1.1.10 hdb0a2a9_1 conda-forge sympy 1.12 pypyh9d50eac_103 conda-forge tensorboardx 2.6.2.2 pyhd8ed1ab_0 conda-forge textual 0.62.0 pyhd8ed1ab_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge tokenizers 0.19.1 py311h6640629_0 conda-forge torch 2.2.2+rocm5.6 pypi_0 pypi torchaudio 2.2.2+rocm5.6 pypi_0 pypi torchvision 0.17.2+rocm5.6 pypi_0 pypi tqdm 4.66.4 pyhd8ed1ab_0 conda-forge transformers 4.40.2 pyhd8ed1ab_0 conda-forge typing-extensions 4.11.0 hd8ed1ab_0 conda-forge typing_extensions 4.11.0 pyha770c72_0 conda-forge tzdata 2024a h0c530f3_0 conda-forge ucx 1.15.0 ha691c75_8 conda-forge urllib3 2.2.1 pyhd8ed1ab_0 conda-forge virtualenv 20.21.0 pyhd8ed1ab_0 conda-forge wheel 0.43.0 pyhd8ed1ab_1 conda-forge wrapt 1.16.0 py311h459d7ec_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xxhash 0.8.2 hd590300_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge yarl 1.9.4 py311h459d7ec_0 conda-forge zipp 3.17.0 pyhd8ed1ab_0 conda-forge zlib 1.2.13 hd590300_5 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge

Reproduction script

This example fine-tunes a LLM for 8 different learning rates. If required, I can provide the whole python script which contains the trainable and the run.sh script that specifies the SLUM parameters (even though I am pretty sure that SLURM has nothing to do with the problem).

    ray.init(num_cpus=56, num_gpus=8, log_to_driver=False)

    config = { "learning_rate":  tune.uniform(1e-6, 1e-3)   }

    # Create a Tuner object
    tuner = tune.Tuner(
        tune.with_resources(
            trainable,
            resources={"cpu": 7, "gpu": 1},  # Set resources for every trial run
        ),
        param_space=config,
        tune_config=tune.TuneConfig(
            num_samples=8,  # Number of samples
            metric="perplexity",  # Metric to optimize
            mode="min",  # Minimize the metric
        ),
    )
    # Run the tuning process
    results = tuner.fit()

Issue Severity

Low: It annoys or frustrates me.

jjyao commented 3 months ago

Hi @gregordecristoforo, this is just a side effect of how we implement accelerator type as custom resources. We will re-implement accelerator type as node labels so you don't see 0.0/1.0

gregordecristoforo commented 3 months ago

OK, good to know. Thank you for the explanation!