[BUG] - Why doesn't my code recognize the GPU on Nebari ?

rsignell commented 3 months ago

Describe the bug

I purposely named this issue the name of this missing page: https://www.nebari.dev/docs/how-tos/faq#why-doesnt-my-code-recognize-the-gpus-on-nebari 😄

We deployed Nebari 2024.03.03 on AWS and we fired up a GPU server successfully (g4dx.xlarge).

We built an environment following these excellent instructions: https://www.nebari.dev/docs/how-tos/pytorch-best-practices/ (although this page contains the broken link above)

When we conda list the environment, it looks good:

08:38 $ conda activate global-pangeo-ml
(global-pangeo-ml) rsignell:~ 
08:38 $ conda list cuda
# packages in environment at /home/conda/global/envs/global-pangeo-ml:
#
# Name                    Version                   Build  Channel
cuda-cudart               11.8.89                       0    nvidia
cuda-cupti                11.8.87                       0    nvidia
cuda-libraries            11.8.0                        0    nvidia
cuda-nvrtc                11.8.89                       0    nvidia
cuda-nvtx                 11.8.86                       0    nvidia
cuda-runtime              11.8.0                        0    nvidia
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
pytorch-cuda              11.8                 h7e8668a_5    pytorch

but when we run:

torch.cuda.is_available()

it returns False.

Is it clear what we did wrong?
Or what we should do to debug?

Expected behavior

See above

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

See above

Command output

No response

Versions and dependencies used.

conda 23.3.1 kubernetes 1.29 nebari 2024.03.03

Compute environment

AWS

Integrations

No response

Anything else?

No response

kcpevey commented 3 months ago

Not sure if this is the answer, but have you tried setting

variables:
  CONDA_OVERRIDE_CUDA: "12.0"

in your environment spec?

Adam-D-Lewis commented 3 months ago

We built an environment following these excellent instructions: nebari.dev/docs/how-tos/pytorch-best-practices (although this page contains the broken link above)

I'm not sure if it'll resolve the issue you're seeing, but the correct link is https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari

dharhas commented 3 months ago

conda-forge will install the CPU version of PyTorch unless you use that env flag listed above. This happens because conda-store builds the env on a non-gpu worker and conda-forge detects that there is no GPU present.

viniciusdc commented 3 months ago

@rsignell the new link Adam shared has those instructions for the different possible versions, can you check if that solves the problem you are encountering?

Adam-D-Lewis commented 3 months ago

BTW, I opened an issue for the broken link - https://github.com/nebari-dev/nebari-docs/issues/426

rsignell commented 3 months ago

@viniciusdc , yes, I used the Nebari-recommended Pytorch tool to create the correct package versions and then I tried to create the simplest possible conda environment for pytorch on Nebari GPU:

channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.11
  - pytorch::pytorch
  - pytorch::pytorch-cuda=11.8
  - numpy
  - ipykernel
variables:
  CONDA_OVERRIDE_CUDA: "12.0"

It builds without errors, but alas, doesn't recognize cuda:

rsignell commented 3 months ago

Also this page https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari seems to provide conflicting information: one the one hand it suggests using pytorch-gpu, but this seems at odds with the suggestion to follow https://www.nebari.dev/docs/how-tos/pytorch-best-practices/, which says to use the pytorch installation matrix, which produces:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

dharhas commented 3 months ago

Those both need updating, I spent a while getting working environments recently on another project. Let me do a few tests and then update here.

dharhas commented 3 months ago

Troubleshooting.

First, make sure you actually have a GPU instance running with nvidia drivers. You can do this by running nvidia-smi from a terminal

Once you have done this you can install a pytorch environment in one of 3 ways. The key issue here is that the pytorch channel and the conda-forge channel use different naming conventions.

In the pytorch channel: the cpu parts are called pytorch and the gpu version is pytorch-cuda and both ned to be installed.
In the conda-forge channel: the cpu version is called pytorch-cpu and the gpu version is called pytorch-gpu but also if you build the environment on a machine that doesn't have a gpu (like conda-store does), conda-forge tries to be clever and installs the non-gpu version even if you specify pytorch-gpu. pytorch is a metapackage that installs whichever is needed.

Use the pytorch, nvidia, and defaults channel. Do not use conda-forge.

channels:
  - pytorch
  - nvidia
  - defaults
dependencies:
  - python=3.11
  - pytorch
  - pytorch-cuda
  - ipykernel
variables: {}

Use the pytorch, nvidia and conda-forge channels and pin both pytorch and pytorch-cuda to come from the pytorch channel otherwise the environment accidentally gets the cpu only pytorch from conda-forge.

channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.11
  - pytorch::pytorch
  - pytorch::pytorch-cuda
  - ipykernel
variables: {}

Only use the conda-forge channel but set CONDA_OVERRIDE_CUDA: "12.0" to force the GPU version.

channels:
  - conda-forge
dependencies:
  - python=3.11
  - pytorch
  - ipykernel
variables:
  CONDA_OVERRIDE_CUDA: "12.0"

I've tested all 3 of these on an AWS deployment with v2024.3.3 and conda-store v2024.3.1

It is possible @rsignell-usgs environment failed because it specified both pytorch:pytorch-cuda=11.8 and CONDA_OVERRIDE_CUDA: "12.0"

I'm testing that now.

dharhas commented 3 months ago

Actually. @rsignell-usgs environment worked as well. Rich can you run nvidia-smi and post the output.

channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.11
  - pytorch::pytorch
  - pytorch::pytorch-cuda=11.8
  - numpy
  - ipykernel
variables:
  CONDA_OVERRIDE_CUDA: "12.0"

Here is a redacted aws yaml for the deployment I was on.

amazon_web_services:
  region: us-gov-west-1
  kubernetes_version: '1.26'
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 2
      max_nodes: 5
    user:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    worker:
      instance: m5.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
    gpu-tesla-g4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 5
      single_subnet: false
      gpu: true
    gpu-tesla-g4-4x:
      instance: g4dn.12xlarge
      min_nodes: 0
      max_nodes: 5
      single_subnet: false
      gpu: true
    gpu-tesla-g3-2x:
      instance: g3.8xlarge
      min_nodes: 0
      max_nodes: 5
      single_subnet: false
      gpu: true
profiles:
  jupyterlab:
  - display_name: Micro Instance
    access: yaml
    groups:
    - developer
    - admin
    description: Stable environment with 0.5-1 cpu / 0.5-1 GB ram
    kubespawner_override:
      cpu_limit: 1
      cpu_guarantee: 0.5
      mem_limit: 1G
      mem_guarantee: 0.5G
      node_selector:
        "dedicated": "user"
  - display_name: Small Instance
    description: Stable environment with 1.5-2 cpu / 6-8 GB ram
    default: true
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1.5
      mem_limit: 8G
      mem_guarantee: 6G
      node_selector:
        "dedicated": "user"
  - display_name: Medium Instance
    description: Stable environment with 3-4 cpu / 12-16 GB ram
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 12G
      node_selector:
        "dedicated": "user"
  - display_name: G4 GPU Instance 1x
    access: yaml
    groups:
    - gpu-access
    description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
    kubespawner_override:
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      extra_resource_limits:
        nvidia.com/gpu: 1
      node_selector:
        beta.kubernetes.io/instance-type: "g4dn.xlarge"
  - display_name: G4 GPU Instance 4x
    access: yaml
    groups:
    - gpu-access
    description: 48 cpu / 192GB RAM / 4 Nvidia T4 GPU (64 GB GPU RAM)
    kubespawner_override:
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
      cpu_limit: 48
      cpu_guarantee: 40
      mem_limit: 192G
      mem_guarantee: 150G
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      extra_resource_limits:
        nvidia.com/gpu: 4
      node_selector:
        beta.kubernetes.io/instance-type: "g4dn.12xlarge"
  - display_name: G3 GPU Instance 2x
    access: yaml
    groups:
    - gpu-access
    description: 32 cpu / 244GB RAM / 2 Nvidia Tesla M60 GPU (16 GB GPU RAM)
    kubespawner_override:
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
      cpu_limit: 32
      cpu_guarantee: 30
      mem_limit: 244G
      mem_guarantee: 200G
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      extra_resource_limits:
        nvidia.com/gpu: 2
      node_selector:
        beta.kubernetes.io/instance-type: "g3.8xlarge"

dharhas commented 3 months ago

cc: @pavithraes re: conflicting GPU best practices pages in docs.

rsignell commented 3 months ago

@dharhas thanks for this info! Indeed when I open a terminal and activate our ML environment, when I type nvidia-smi I get "command not found". And when I google that it says that if it's not found, I need to install the nebari-utils package on the system. Is that expected to be found in the base GPU container?

My config section for the GPU instance looks like:

  - display_name: G4 GPU Instance 1x
    description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
    kubespawner_override:
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      node_selector:
        "dedicated": "gpu-1x-t4"

I notice we don't have these lines in our config:

      extra_resource_limits:
        nvidia.com/gpu: 1

because we took those out while trying to get the GPU instance to launch, right @pt247 ?

viniciusdc commented 3 months ago

Hi @rsignell, I will review this today as part of the above issue and follow up once I test those steps.

rsignell commented 3 months ago

Thanks @viniciusdc ! I'm hoping to use this on Thursday for my short course!

viniciusdc commented 3 months ago

@rsignell just as a followback, I think I found the issue with the config above. I will test the config that should work now, and paste here for you to test as well

viniciusdc commented 3 months ago

@rsignell I just tested the following on a fresh install in AWS. Let me know if this fixes your problem:

your node selectors need to match the node group name, not the instance type:

  node_selector:
    "dedicated": "gpu-tesla-g4" # based on what your node groups look like from the comments above

(Here's a config file that worked for me for example)

amazon_web_services:
kubernetes_version: "1.29"
region: us-east-1
node_groups:
...
gpu-tesla-g4:
  instance: g4dn.xlarge
  min_nodes: 0
  max_nodes: 5
  single_subnet: false
  gpu: true
profiles:
jupyterlab:
...
- display_name: G4 GPU Instance 1x
  description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
  kubespawner_override:
    image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
    cpu_limit: 4
    cpu_guarantee: 3
    mem_limit: 16G
    mem_guarantee: 10G
    extra_pod_config:
      volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
    extra_container_config:
      volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
    extra_resource_limits:
      nvidia.com/gpu: 1
    node_selector:
      "dedicated": "gpu-tesla-g4"

As Dharhas mentioned above, this is an environment that builds and works:

channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda
- ipykernel
variables: {}

I added a recording with a pytorch basic execution example as well:

https://github.com/nebari-dev/nebari/assets/51954708/babe9200-3a6b-4a4b-ab8e-82e284827457

viniciusdc commented 3 months ago

Following up on our previous discussions, I met with Rich yesterday to talk about the issue of GPU spawning. We ended up fixing the problem when I shared my config with him. Later, by comparing the diff of them, I identified the main issues as follows:

One of the profiles attempted to reference a nonexistent node_group due to a typo.
The GPU node group was missing a gpu: true key, preventing the driver installation daemon from installing necessary dependencies for GPU connectivity, which is why nvidia-smi was not functioning.
Interestingly, it appears possible to request a GPU even without the gpu: true flag; however, the request fails due to missing drivers. Additionally, specifying nvidia/gpu: causes the profile to time out.

What can we do moving forward? We must improve our schema to ensure it correctly links and validates such relationships. Once the schema is reworked, the last issue should be resolved, as it likely stems from unsupported or unexpected configuration scenarios.

viniciusdc commented 3 months ago

@rsignell I will keep this issue open until https://github.com/nebari-dev/nebari-docs/issues/427 is addressed

nebari-dev / nebari