SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
When users specify file_mounts using cloud buckets in MOUNT mode, we use a device plugin (smarter-device-manager) to provide the SkyPilot pod access to /dev/fuse without running the container in privileged mode.
This device plugin creates and advertises an extended resource smarter-devices/fuse on each node. Any pod requiring FUSE mounting requests smarter-devices/fuse: 1 resource, and the kubelet running the device plugin mounts /dev/fuse directly in the container.
Problems
Interaction with autoscalers: because we request smarter-devices/fuse: 1 resource in the pod, cluster autoscalers (e.g., GKE) get tripped up by this new resource request.
They fail to provision new nodes because none of their configured nodes offers smarter-devices/fuse resource.
This resource is a “virtual” resource created by our smarter-device-manager daemonset, but GKE does not have any way to configure the autoscaler to ignore this resource request when making autoscaling decisions.
Reliability: Users have reported issues where the smarter-devices pods would fail silently and mounting stops working.
Current recommendation
Continue using smarter-devices device plugin. For users with autoscaling issues on GKE, recommend using GCSFuse CSI Driver with pod_config. Eventually we should build our own FUSE proxy or find better solutions.
Background
When users specify file_mounts using cloud buckets in
MOUNT
mode, we use a device plugin (smarter-device-manager) to provide the SkyPilot pod access to/dev/fuse
without running the container in privileged mode.This device plugin creates and advertises an extended resource smarter-devices/fuse on each node. Any pod requiring FUSE mounting requests smarter-devices/fuse: 1 resource, and the kubelet running the device plugin mounts /dev/fuse directly in the container.
Problems
Interaction with autoscalers: because we request
smarter-devices/fuse: 1
resource in the pod, cluster autoscalers (e.g., GKE) get tripped up by this new resource request.Reliability: Users have reported issues where the smarter-devices pods would fail silently and mounting stops working.
Current recommendation
Continue using smarter-devices device plugin. For users with autoscaling issues on GKE, recommend using GCSFuse CSI Driver with
pod_config
. Eventually we should build our own FUSE proxy or find better solutions.