kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.19k stars 39.43k forks source link

Kubelet device plugin to pass additional information about Pods to Allocate call #59109

Closed kad closed 5 years ago

kad commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature /sig node /area hw-accelerators

What happened: At the moment device plugin APIs in terms of Allocate call has no information about workload that is requesting particular HW device.

What you expected to happen:

We would like to have in Allocate call to get more information from Pod spec. At minimum it should be pod/container annotations, ideally it can be full spec.

Usage scenarios for that:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: cc: @RenaudWasTaken @ns-sundar @sameo

RenaudWasTaken commented 6 years ago

Duplicate issue of kubernetes/kubernetes/issues/59110 Can you close this one, I'll answer on the #59110 ?

kad commented 6 years ago

@RenaudWasTaken no, it is not duplicate. this one to get pod annotations to the allocate call, #59110 is about Deallocate :)

RenaudWasTaken commented 6 years ago

Oh sorry you are right! cc @jiayingz @vishh @vikaschoudhary16

RenaudWasTaken commented 6 years ago

I am favorable to passing information through the Allocate call. This is something we are going to need for GPUs down the line too in order to signal if the container has a specific need for GPU to GPU communication.

In general I wonder if the Allocate call should be reformated as a single call that would contain: it would contain:

In general the pod annotation are good but it would be better to have annotations specific to the container.

kad commented 6 years ago

@RenaudWasTaken currently, Allocate() is called how many times in case of presence of Init containers, and multiple containers in the pod ? If normal container in the pod asking for device, does that mean that device will be also exposed to initcontainer ?

RenaudWasTaken commented 6 years ago

Allocate() is called how many times in case of presence of Init containers, and multiple containers in the pod

Once per container requesting the advertised resource. If an init container requests a device and a normal container requests a device, the current algorithm will re-use that device.

kad commented 6 years ago

That might open can of worms actually :(

resouer commented 6 years ago

@kad Please involve me in the loop. We are working on #49964.

Including pod annotation at least in Allocate(), would be a potential solution we are looking at, so we can deal with topology awareness in DP.

@sameo knows what I am doing :)

RenaudWasTaken commented 6 years ago

if init container is using musl as libc, and workload container is using glibc, Allocate() response for them might be different (e.g. different mounts for libraries). Simple reuse of response might not be enough.

The device is re-used but not the Allocate call. An allocate call is still issued for each InitContainers and each Containers.

RenaudWasTaken commented 6 years ago

This was discussed on a parallel issue, @jiayingz mentioned the following

I know there have been multiple asks on passing more container and pod information to plugins. I think other K8s areas have also seen similar asks but they have avoided doing this for good reasons. We are going to see more plugins on different areas like devices, storage, and network, and will need a general policy on what information should be passed to plugins. If you believe there are strong reasons to pass pod and container information to plugins, I suggest you to discuss this with sig-node leads and other senior K8s leads. Until we get consensus and green lights from them, we are not going to open the door on device plugins.

For Solarflare device plugins, I think using checkpoint file is meant to be a temporary solution to unblock their progress. I don't know the specific reasons why they don't want to use CNI to expose network interfaces. It could be CNI is hard to use. If that is the case, I think there have been discussions on a network plugin model that could help address this problem. When we were discussing a general kubelet plugin registration model, we also discussed how to make it easy to support multiple types of plugins from a single component. In terms how to make it easy to expose a network interface on K8s, I think sig-network perhaps a better place for such topics.

kad commented 6 years ago

@RenaudWasTaken ok. so, same device re-used it is fine, but then for Allocate calls for each initContainer or normal container, it should be way for plugin to get that information.

In respect of unifying ways how plugins are called, I think we comparing apples and oranges. I'll look in more details on how CSI is handling it, but initial thinking is that at the time SIG-Storage designed CSI APIs, there were good understanding and good knowledge which "objects" plugins will be operating with and how resource management done for those objects (storage classes, mount options, sizes, etc). With device plugins we are not there yet. And we might need to do several iterations of changes to get all usage cases covered and gather experience to get final design of interfaces.

resouer commented 6 years ago

@kad @RenaudWasTaken @jiayingz I am not sure about:

Agree with Vikas' concerns. I think other kubelet plugins, like CSI, have been explicit on avoid passing Pod and Container information to plugins. Device plugin probably want to follow the similar model. I think this also allows better separation of concerns that device plugins only need to care about devices.

It indeed make 100+ sense to me that device plugins only need to care about devices, but currently, devices allocation is done in kubelet, with a linear model, which looks problematic to me.

As we mentioned CSI, one thing is that volumes do have their own scheduling logic in the scheduler, not to mention the devices, which inherently has the concept of topology.

My concern of including annotations etc in dp.Allocate() will be, it will trigger more request to add like pod name pod spec to the DP api, that should be a problem.

So the right approach sounds like move allocate logic from kubelet to scheduler? But it would include a extra API object at least. We may want to discuss it.

kad commented 6 years ago

Reasons to include annotations are not for choosing device or scheduling, but more about how to prepare device according to request. We don't have way at the moment e.g. to ask for OpenCL accelerator, but then choose which versions of libraries to mount inside container (glibc vs. musl based, opensource or closed optimized binaries, etc). This is all about passing device specific attributes that are currently not possible.

To draw analogies with storage, those are mount options e.g. for NFS. In example of https://github.com/kubernetes/examples/blob/master/staging/volumes/nfs/nfs-pv.yaml annotations will be used similar way as server and path fields.

Or to show another example with CSI, I'm talking about options like here: https://github.com/kubernetes/kubernetes/blob/master/examples/volumes/flexvolume/nginx.yaml#L20 It would be good to have similar approach to requests for devices that are announced by device plugins, but that is not possible without serious changes, that's why I thought that annotations will be easier first step to unblock usages of devices and get real information from the field on how options for requested devices should be handled in the future.

PS. Having some designs that depends on pod name sounds really strange to me.

resouer commented 6 years ago

@kad Unfortunately, Pod name is the identifier of network namespace path, so most of network hardware plugins will need it. For example, SolarFlare in this doc. Of course, we can add that name in Pod annotation somehow but the essential problem would be there tend to be more.

Actually, including pod annotation in Allocate() also solves our problem. But we may want to find a general way to avoid frequently change of DP's API.

One option like: add options in DP, and let devicemanager of kubelet amend Pod annotation, Pod name into this data structure. WDYT?

kad commented 6 years ago

@resouer in long term, I'd love to see options to be passed to DP and being able to use them in Pod resource requests. In short term I feel that annotations can be more flexible entity, which does not require to modify Pod or resource requests specs, thus while we don't have critical mass to define proper way and while Resource Class is still under discussion, this can be a solution which unblock further development of different use cases. We can later add options or flags similar to what CSI does and still keep annotations as "use at your own risk, behavior is unpredictable and depends on individual plugin". I might be wrong in my thoughts here. If there are better ways to solve fundamental issue: "pass additional information (options and tunables) from requestor of device to device plugin", I'm all for it.

Regarding name of the pod and SolarFlare use cases: I haven't in detail followed that topic, can't say for sure, however it sounds to me more the issue about having proper way to define multiple network interfaces to the pod. Which is big API change that SIG-Network is discussing already for quite some time. Some workarrounds via device plugins and some in CNIs might help to overcome limitations here, but during KubeCon's SIG-Network working sessions and talking to other people in the breaks I've got impression that without properly solving it in redesigning networking model to allow multiple interfaces, it is not going to be something that would satisfy everybody. If since December, there is already some WG that is looking for defining multinetworking, I'd be interested to join it.

jiayingz commented 6 years ago

cc @vishh @dchen1107 @yguo0905

Thanks a lot for your inputs, @kad

Generally I think we should avoid passing pod or container information to device plugins, but pod annotation is probably exceptional given that it has been used across various k8s areas to pass special information for different purposes. If we can limit the scope of this FR to pass pod annotations from kubelet to device plugin, I think we may consider to include it in v1beta1 device plugin API.

songole commented 6 years ago

@jiayingz Is this being considered for some release?

jiayingz commented 6 years ago

I am thinking to draft a 1.11 device plugin work planning doc. This could be feature to consider.

vikaschoudhary16 commented 6 years ago

@jiayingz @songole Would appreciate your thoughts on https://docs.google.com/document/d/1TpYqgmJQOctm1oexQRo0cGNb9oyk0vozhmhr6myY99M/edit#

jiayingz commented 6 years ago

@vikaschoudhary16 If we reach agreement on extending device plugin API to pass pod annotation from Kubelet to device plugin, it sounds a reasonable short-term solution to pass certain device plugin specific information through this channel. However, I think pod annotation is mostly meant to facilitate short-term experiment. We need to be careful to not over-use it as any long-term solution, which might lead to fragmented implementations and unknown maintenance burdens in the future. That is actually my main concern of adding pod annotation in the device plugin API.

I do feel we need a better long-term solution for network plugins, and there are several missing pieces that may require a good amount of design and changes throughout the system. Some of these missing pieces may be covered by device plugin, but I don't think it can cover all and I perhaps requires a lot of inputs from the sig-network folks and CRI folks. If there hasn't been an existing issue, can you create one to track long-term network plugin solutions? You may want to add folks in the resource management group who have involved in the discussion, as well as @thockin @bowei @freehan and @dchen1107 to get their inputs.

fabiand commented 6 years ago

Iagree on both the short and long-term aspects. And do also faor a tracker for this. However, for the API we shuold also consider additional devices in order to find the right DP API.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

bart0sh commented 5 years ago

/reopen

k8s-ci-robot commented 5 years ago

@bart0sh: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/59109#issuecomment-473388626): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/59109#issuecomment-483040086): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.