nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
271 stars 89 forks source link

[MAINT] - Test GPU configurations for AWS and update relevant documentation #2400

Open viniciusdc opened 3 months ago

viniciusdc commented 3 months ago

A recent deployment of Nebari 2024.03.03 on an AWS with a g4dx.xlarge GPU profile has led to an issue where, despite the CUDA-related packages appearing correctly configured, torch.cuda.is_available()still returns False. This indicates a failure to recognize the GPU Cuda drivers. Additionally, the nvidia-smi command is not found, which suggests potential issues with NVIDIA driver integration or installation (handled non-implicitly by the existence of gpu: true in the configuration settings)

Steps to resolve this issue:

Current configuration profile:

  - display_name: G4 GPU Instance 1x
    description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
    kubespawner_override:
      image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      extra_pod_config:
        volumes:
        - name: "dshm"
          emptyDir:
            medium: "Memory"
            sizeLimit: "2Gi"
      extra_container_config:
        volumeMounts:
        - name: "dshm"
          mountPath: "/dev/shm"
      node_selector:
        "dedicated": "gpu-1x-t4"

Additional details

Relevant issue #2392

viniciusdc commented 3 months ago

More details on the original thread, but the main problem was that our fix for the scale to zero issue introduced a new tagging mechanism using the dedicated attribute in each profile. This was not documented anywhere that I could find, and our GPU docs not only were not migrated (or were removed) but also didn't follow the new schema.