GPUs - Githubissues

jacobtomlinson commented 6 years ago

We should add the ability to use GPUs.

jacobtomlinson commented 6 years ago

I've created a new autoscaling group which uses p3.2xlarge GPU instance types which are the smallest available in London currently. I've also added a taint to avoid non-GPU work being scheduled on them.

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-11-14T14:02:09Z
  labels:
    kops.k8s.io/cluster: cluster.k8s.informaticslab.co.uk
  name: nodes-GPU-eu-west-2a-p3-2xlarge
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    kubernetes.io/cluster/cluster.k8s.informaticslab.co.uk: owned
  hooks:
  - execContainer:
      image: kopeio/nvidia-bootstrap:1.6
  - manifest: "Type=oneshot                    \nExecStart=/usr/bin/docker run --net
      host quay.io/sergioballesteros/check-aws-tags\nExecStartPost=/bin/systemctl
      restart kubelet.service\n"
    name: ensure-aws-tags.service
    requires:
    - docker.service
    roles:
    - Node
  image: kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08
  kubelet:
    featureGates:
      Accelerators: "true"
  machineType: p3.2xlarge
  maxSize: 1
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-GPU-eu-west-2a-p3-2xlarge
  role: Node
  rootVolumeSize: 120
  rootVolumeType: gp2
  subnets:
  - eu-west-2a
  taints:
  - informaticslab.co.uk/dedicated=gpu:NoSchedule

We can add the following option to the profile list to add a GPU notebook.

{
  'display_name': 'Informatics Lab - ML Pangeo Notebook v0.5.10 (expensive)',
  'kubespawner_override': {
    'image': '536099501702.dkr.ecr.eu-west-2.amazonaws.com/pangeo-notebook:0.5.10',
    'cpu_limit': 8,
    'mem_limit': '54G',
    'extra_resource_guarantees': {"nvidia.com/gpu": "1"},
    'tolerations': [
      {
          'key': 'informaticslab.co.uk/dedicated',
          'operator': 'Equal',
          'value': 'gpu',
          'effect': 'NoSchedule'
      },
    ]
  }
}

This image specifies a GPU and will exactly fill a p3.2xlarge instance. However there are a few things which stop this from working:

The tolarations are not being set in KubeSpawner for some reason.
When trying to schedule on a p3.2xlarge it doesn't seem to consider itself to have "nvidia.com/gpu": "1" available.
Some plugin DaemonSet pods such as the FUSE volume plugin do not get scheduled due to the taint.

jacobtomlinson commented 6 years ago

https://github.com/kubernetes/kops/blob/master/docs/gpu.md

jacobtomlinson commented 6 years ago

https://github.com/dcwangmit01/kops/tree/gpu-device-plugins-3/hooks/nvidia-device-plugin

jacobtomlinson commented 6 years ago

https://github.com/kubernetes/kops/pull/5502

informatics-lab / our-pangeo

GPUs #26