Open rsignell opened 3 months ago
Not sure if this is the answer, but have you tried setting
variables:
CONDA_OVERRIDE_CUDA: "12.0"
in your environment spec?
We built an environment following these excellent instructions: nebari.dev/docs/how-tos/pytorch-best-practices (although this page contains the broken link above)
I'm not sure if it'll resolve the issue you're seeing, but the correct link is https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari
conda-forge will install the CPU version of PyTorch unless you use that env flag listed above. This happens because conda-store builds the env on a non-gpu worker and conda-forge detects that there is no GPU present.
@rsignell the new link Adam shared has those instructions for the different possible versions, can you check if that solves the problem you are encountering?
BTW, I opened an issue for the broken link - https://github.com/nebari-dev/nebari-docs/issues/426
@viniciusdc , yes, I used the Nebari-recommended Pytorch tool to create the correct package versions and then I tried to create the simplest possible conda environment for pytorch on Nebari GPU:
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda=11.8
- numpy
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
It builds without errors, but alas, doesn't recognize cuda:
Also this page https://www.nebari.dev/docs/faq/#why-doesnt-my-code-recognize-the-gpus-on-nebari seems to provide conflicting information: one the one hand it suggests using pytorch-gpu
, but this seems at odds with the suggestion to follow https://www.nebari.dev/docs/how-tos/pytorch-best-practices/, which says to use the pytorch installation matrix, which produces:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Those both need updating, I spent a while getting working environments recently on another project. Let me do a few tests and then update here.
Troubleshooting.
First, make sure you actually have a GPU instance running with nvidia drivers. You can do this by running nvidia-smi
from a terminal
Once you have done this you can install a pytorch environment in one of 3 ways. The key issue here is that the pytorch channel and the conda-forge channel use different naming conventions.
pytorch
channel: the cpu parts are called pytorch
and the gpu version is pytorch-cuda
and both ned to be installed.conda-forge
channel: the cpu version is called pytorch-cpu
and the gpu version is called pytorch-gpu
but also if you build the environment on a machine that doesn't have a gpu (like conda-store does), conda-forge tries to be clever and installs the non-gpu version even if you specify pytorch-gpu
. pytorch
is a metapackage that installs whichever is needed. pytorch
, nvidia
, and defaults
channel. Do not use conda-forge
.channels:
- pytorch
- nvidia
- defaults
dependencies:
- python=3.11
- pytorch
- pytorch-cuda
- ipykernel
variables: {}
pytorch
, nvidia
and conda-forge
channels and pin both pytorch
and pytorch-cuda
to come from the pytorch channel otherwise the environment accidentally gets the cpu only pytorch
from conda-forge.channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda
- ipykernel
variables: {}
conda-forge
channel but set CONDA_OVERRIDE_CUDA: "12.0"
to force the GPU version.channels:
- conda-forge
dependencies:
- python=3.11
- pytorch
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
I've tested all 3 of these on an AWS deployment with v2024.3.3 and conda-store v2024.3.1
It is possible @rsignell-usgs environment failed because it specified both pytorch:pytorch-cuda=11.8
and CONDA_OVERRIDE_CUDA: "12.0"
I'm testing that now.
Actually. @rsignell-usgs environment worked as well. Rich can you run nvidia-smi
and post the output.
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda=11.8
- numpy
- ipykernel
variables:
CONDA_OVERRIDE_CUDA: "12.0"
Here is a redacted aws yaml for the deployment I was on.
amazon_web_services:
region: us-gov-west-1
kubernetes_version: '1.26'
node_groups:
general:
instance: m5.2xlarge
min_nodes: 2
max_nodes: 5
user:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
worker:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
gpu-tesla-g4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
gpu-tesla-g4-4x:
instance: g4dn.12xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
gpu-tesla-g3-2x:
instance: g3.8xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
profiles:
jupyterlab:
- display_name: Micro Instance
access: yaml
groups:
- developer
- admin
description: Stable environment with 0.5-1 cpu / 0.5-1 GB ram
kubespawner_override:
cpu_limit: 1
cpu_guarantee: 0.5
mem_limit: 1G
mem_guarantee: 0.5G
node_selector:
"dedicated": "user"
- display_name: Small Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 6G
node_selector:
"dedicated": "user"
- display_name: Medium Instance
description: Stable environment with 3-4 cpu / 12-16 GB ram
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 12G
node_selector:
"dedicated": "user"
- display_name: G4 GPU Instance 1x
access: yaml
groups:
- gpu-access
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
beta.kubernetes.io/instance-type: "g4dn.xlarge"
- display_name: G4 GPU Instance 4x
access: yaml
groups:
- gpu-access
description: 48 cpu / 192GB RAM / 4 Nvidia T4 GPU (64 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 48
cpu_guarantee: 40
mem_limit: 192G
mem_guarantee: 150G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 4
node_selector:
beta.kubernetes.io/instance-type: "g4dn.12xlarge"
- display_name: G3 GPU Instance 2x
access: yaml
groups:
- gpu-access
description: 32 cpu / 244GB RAM / 2 Nvidia Tesla M60 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 32
cpu_guarantee: 30
mem_limit: 244G
mem_guarantee: 200G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 2
node_selector:
beta.kubernetes.io/instance-type: "g3.8xlarge"
cc: @pavithraes re: conflicting GPU best practices pages in docs.
@dharhas thanks for this info! Indeed when I open a terminal and activate our ML environment, when I type
nvidia-smi
I get "command not found". And when I google that it says that if it's not found, I need to install the nebari-utils
package on the system. Is that expected to be found in the base GPU container?
My config section for the GPU instance looks like:
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
node_selector:
"dedicated": "gpu-1x-t4"
I notice we don't have these lines in our config:
extra_resource_limits:
nvidia.com/gpu: 1
because we took those out while trying to get the GPU instance to launch, right @pt247 ?
Hi @rsignell, I will review this today as part of the above issue and follow up once I test those steps.
Thanks @viniciusdc ! I'm hoping to use this on Thursday for my short course!
@rsignell just as a followback, I think I found the issue with the config above. I will test the config that should work now, and paste here for you to test as well
@rsignell I just tested the following on a fresh install in AWS. Let me know if this fixes your problem:
node_selector:
"dedicated": "gpu-tesla-g4" # based on what your node groups look like from the comments above
(Here's a config file that worked for me for example)
amazon_web_services:
kubernetes_version: "1.29"
region: us-east-1
node_groups:
...
gpu-tesla-g4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
single_subnet: false
gpu: true
profiles:
jupyterlab:
...
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.3.3
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"dedicated": "gpu-tesla-g4"
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.11
- pytorch::pytorch
- pytorch::pytorch-cuda
- ipykernel
variables: {}
I added a recording with a pytorch basic execution example as well:
https://github.com/nebari-dev/nebari/assets/51954708/babe9200-3a6b-4a4b-ab8e-82e284827457
Following up on our previous discussions, I met with Rich yesterday to talk about the issue of GPU spawning. We ended up fixing the problem when I shared my config with him. Later, by comparing the diff of them, I identified the main issues as follows:
What can we do moving forward? We must improve our schema to ensure it correctly links and validates such relationships. Once the schema is reworked, the last issue should be resolved, as it likely stems from unsupported or unexpected configuration scenarios.
@rsignell I will keep this issue open until https://github.com/nebari-dev/nebari-docs/issues/427 is addressed
Describe the bug
I purposely named this issue the name of this missing page: https://www.nebari.dev/docs/how-tos/faq#why-doesnt-my-code-recognize-the-gpus-on-nebari 😄
We deployed Nebari 2024.03.03 on AWS and we fired up a GPU server successfully (g4dx.xlarge).
We built an environment following these excellent instructions: https://www.nebari.dev/docs/how-tos/pytorch-best-practices/ (although this page contains the broken link above)
When we conda list the environment, it looks good:
but when we run:
it returns
False
.Is it clear what we did wrong?
Or what we should do to debug?
Expected behavior
See above
OS and architecture in which you are running Nebari
Linux
How to Reproduce the problem?
See above
Command output
No response
Versions and dependencies used.
conda 23.3.1 kubernetes 1.29 nebari 2024.03.03
Compute environment
AWS
Integrations
No response
Anything else?
No response