Closed raz-bn closed 4 years ago
If you want to use this feature, please take a look at https://github.com/tkestack/gpu-manager
If you want to use this feature, please take a look at https://github.com/tkestack/gpu-manager
@mYmNeo, what feature are you talking about? I already have the gpu Manger and gpu admission running in my cluster. But I don't know how to set up the vcuda controller, the gpu manager repo doesn't provide this info.
I manage to solve the nvidia-smi issue according to this issue in the gpu-manager repo by adding this:
securityContext:
privileged: true
to my YAML file, still looking for a better way to grant non-root user the permission of /etc/vcuda.
The problem with the GPU stress test is still present, hope you can help me with that @mYmNeo
@raz-bn I don't think you need to build your docker image with vcuda-control. Just use gpu-manager and gpu-admission. It will work fine
@raz-bn I don't think you need to build your docker image with vcuda-control. Just use gpu-manager and gpu-admission. It will work fine
@joyme123 Now I am a bit confused, so what do I do with the vcuda-controller?
@raz-bn if you want to build gpu-manager image with vcuda support by yourself. vcuda-controller is useful.otherwise,you don't need to do anything.
@raz-bn vcuda-controller is for this feature
GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.
the default gpu-manager image is built with vcuda-controller, so you don't need to do anything if you use the default gpu-manager image.
@raz-bn vcuda-controller is for this feature
GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.
the default gpu-manager image is built with vcuda-controller, so you don't need to do anything if you use the default gpu-manager image.
@joyme123, I will verify if and post my results to help others in the future hopefully. But if what you say is true, I think the gpu-manger readme file needs to be changed, it says:
GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.
which sounds like you need to install the vcuda-controller by yourself.
Thank you very very much for your reply!! I will post results soon
@raz-bn vcuda-controller is for this feature
GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.
the default gpu-manager image is built with vcuda-controller, so you don't need to do anything if you use the default gpu-manager image.
@joyme123, I will verify if and post my results to help others in the future hopefully. But if what you say is true, I think the gpu-manger readme file needs to be changed, it says:
GPU manager also supports the payload with fraction resource of GPU device such as 0.1 card or 100MiB gpu device memory. If you want this kind feature, please refer to vcuda-controller project.
which sounds like you need to install the vcuda-controller by yourself.
Thank you very very much for your reply!! I will post results soon
@raz-bn Hi, bro. Do you figure it out? I'm also confused by the README file and cannot use fractional gpus.
The first question is how to use the project on earth?
The second is whether vcuda-controller
project should be used if I would like to use 0.5 gpu?
Could u give more details or your valuable experience about GPU Manager
, GPU admission
and vcuda-controller
?
Any help would be appreciated!!!
@Servon-Lee hey! First of all, this project (GaiaGPU )and all of its components (GPU Manager, GPU admission, and vcuda-controller) are great! Really innovative but also poorly documented, in my opinion (wasted tons of time while trying to fit it to my use case, more docs would save me plenty of time).
They do provide a paper, which is useful to understand the underlying concepts of this project, but it is not nearly enough to understand the implementation.
vcuda-controller This component is a wrapper to CUDA libraries. This wrapper "catch" memory allocation calls and limit them according to the pod limitation (The how and why it is working is the fun part). Still, you don't need to do anything with the vcuda-controller since it has already compiled for you when deploying the GPU-manager project.
GPU Manager This component has few parts in it (not going to explain them all). But it is the one which makes sure your pods have the vcuda-controller in them.
GPU admission This component is a scheduler extender, and its primary role is to make sure there are no GPU fragmentation issues.
If you want to deploy this project on your cluster you only need to deploy the GPU-manager and the GPU-admission unless you have only 1 GPU card (testbed env) and then you only need the GPU Manager since GPU fragmentation is not an issue
@raz-bn Thanks for your detailed explanation and it's very useful to me. But when I config GPU admission
, I have to change scheduler's policy in step 2.2, so I use this command kube-scheduler --policy-config-file=scheduler-policy-config.json --use-legacy-policy-config=true
. However, this error popped up and made me confused:
By the way, the 10251 port is listened by default kube-scheduler.
My intuition tells me that killing the process and then running the command is not a good idea (I actually did this but had no use). I have no idea about how to correctly run kube-scheduler --policy-config-file=scheduler-policy-config.json --use-legacy-policy-config=true
. Did you encounter the similar situation? Looking forward to your reply. Thanks a ton!
@Servon-Lee I'm not sure about the command (i did it manually on my master node) but I don't believe it is the problem. I'm order to deploy it correctly you first need to deploy the GPU admission as a Deployment, and set up a service account and service with cluster IP. Then in the scheduler config file in you need to put the service's cluster IP (since the scheduler will communicate with the extender via this address) local host will only work if you deploy the gpu admission on the same node as the kube scheduler (if you do it make sure you replicate it on all the masters) .
i did it manually on my master node
I wonder how did you manually do that? I always got the error failed to create listener: failed to listen on 0.0.0.0:10251: listen tcp 0.0.0.0:10251: bind: address already in use
@Servon-Lee I guess you are getting this error since the port you are trying to bind is already in use.
@Servon-Lee I guess you are getting this error since the port you are trying to bind is already in use.
Yes, it is used by the default kube scheduler, but I don't know what to do next.
@Servon-Lee
*note If you modify scheduler-policy without setting up the extender first, you will not be able to schedule any pods in your cluster. This is the case with any scheduler extender.
4. create a service witch cluster IP (target port should be the same as the GPU-admission listening port the default is 3456)
@raz-bn Thank you soooooo much. But I'm still not sure how to create a service. Do you mean systemctl start gpu-admission.service
and systemctl enable gpu-admission.service
?
Besides, when running kubectl apply -f gpu-admission.yaml
, I got the CrashLoopBackOff
as figure below:
Is there anything wrong?
@Servon-Lee It seems like you are not really familiar with k8s, I suggest you to read (or watch some videos) about it and it's core concepts it will really help you out deploying this project. I was talking about k8s service.
Now about the error you get, I pretty sure in your YAML file you have this env:
- name: EXTRA_FLAGS
value: "--incluster-mode=false"
and it supposed to be --incluster-mode=true
since you deployed it as a pod in your cluster 😃
@raz-bn Thank you bro.🤝 I'm new around here. I think it's time to learn k8s systematically.
@raz-bn
Hi bro, sorry to bother you but I'm really eager to use this function. Could you please provide all the prerequisites in order to deploy gpu-manager
and gpu-admission
, such as the service file of gpu-admission. Thanks a lot!
+1
@Servon-Lee @xs233
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-admission
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-admission
spec:
replicas: 1
selector:
matchLabels:
app: gpu-admission
template:
metadata:
labels:
app: gpu-admission
spec:
containers:
- name: gpu-admission
image: <Image>
env:
- name: LOG_LEVEL
value: '5'
- name: EXTRA_FLAGS
value: '--incluster-mode=true'
securityContext:
privileged: true
ports:
- hostPort: 3456
containerPort: 3456
protocol: TCP
volumeMounts:
- name: kubernetes
readOnly: true
mountPath: /etc/kubernetes/
- name: log
mountPath: /var/log/gpu-admission
serviceAccount: gpu-admission
volumes:
- name: kubernetes
configMap:
name: gpu-admission.config
defaultMode: 420
- name: log
hostPath:
path: /var/log/gpu-admission
type: DirectoryOrCreate
---
apiVersion: v1
kind: Service
metadata:
name: gpu-admission
spec:
selector:
app: gpu-admission
ports:
- name: 3456-tcp
protocol: TCP
port: 3456
targetPort: 3456
type: ClusterIP
This should work but it is not production-ready by any means, make sure to add your image instead of the place holder. After creating you will need to get the cluster IP assigned to the service and put it in the scheduler policy in the urlPrefix field.
@raz-bn Reaaaaally appreciate!
After creating you will need to get the cluster IP assigned to the service and put it in the scheduler policy in the urlPrefix field.
This is very crucial!
@Servon-Lee @raz-bn Thanks for your dicussion! I want to depoly it with the above yaml file. I wonder where is the gpu-admission.config
. Can I delete the following configuration?
volumes:
- name: kubernetes
configMap:
name: gpu-admission.config
defaultMode: 420
Hi! I want to try and use your project; however, it is not clear to me how to use the vcuda-controller. The image I get from the script ./build-img.sh should be used as a base image for my GPU application? Should it be deployed on my k8s cluster?
I tried to use the vcuda-controller as a base image for a simple GPU CUDA stress test using this Docker file:
and I get this error:
also when trying to run the vcuda-controller image as is on my k8s cluster (GPU-manager and GPU-admission are also present) using the example YAML from the GPU-manager repo:
verfing GPU is attached:
I can not manage to make Nvidia-smi work, and it just hangs without any output I will be happy if you can share more information about how to use the vcuda-controller