Closed clumsy closed 3 weeks ago
Please advise @kiukchung, @d4l3k
Looks like this issue can be closed after the fix was merged, @andywag.
I still wonder if we can remove device_request
completely from local_docker
to let it default to compute,utility
.
🐛 Bug
nvidia Docker images require adding libraries like
libnvidia-ml
that are part ofutility
capability.TorchX currently only adds
compute
here.There are two solutions I verified for this issue, not sure which one is better:
utility
next tocompute
. Similar fixes here and heredocker_scheduler
and rely on default values OR let the user customize vianvidia-container-toolkit
, e.g. we don't set this foraws_batch_scheduler
and it works fine here. Not sure if it's backward compatible with old versions of nvidia-container-runtime though.NOTE: nvidia-container-runtime has been superseded by nvidia-container-toolkit.
Module (check all that applies):
torchx.spec
torchx.component
torchx.apps
torchx.runtime
torchx.cli
torchx.schedulers
torchx.pipelines
torchx.aws
torchx.examples
other
To Reproduce
Steps to reproduce the behavior:
1. 1. 1.
This results in a crash with:
Expected behavior
libnvidia-ml
and other libraries should be added to container.Environment
conda
,pip
, source,docker
):Additional context