Open eero-t opened 1 year ago
Hi @eero-t
Thank you for flag this and for your suggestions. We are looking into them.
Hi @eero-t would you be able to share both the output of the configure-scheduler.sh run (the output of the command you pasted above) and resulting kube-scheduler.yaml output? I'm expecting the script changed the file in a way that the scheduler would not start anymore. If that's not available, could you share the current kube-scheduler manifest? I would like to understand what was the problem in the first place and without it it's pretty much impossible.
As for the other suggestions, I have added them to our backlog. Will let you know when they will be picked up
As script did not provide a backup (and I was not the person updating that cluster to k8s 1.26), I do now know what the starting point was.
If I run the command now (with fixed config), it outputs following:
# ./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Manifest file is located at: /etc/kubernetes/manifests/kube-scheduler.yaml
Scheduler config file is located at: ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
Scheduler config destination will be: /etc/kubernetes
Version of the image used in the kube scheduler is: 26
Version of the KubeScheduler API: v1
Will proceed to copy the scheduler configuration to its destination path: /etc/kubernetes.
Determined that the appropriate nodeSelector key would be: control-plan
And does not change kube-scheduler.yaml
any more.
Which looks now like this:
...
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler-config-tas+gas.yaml
name: schedulerconfig
- hostPath:
path: /etc/certs
name: certdir
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
After the initial run of the script, there were additional path:
& type:
lines for the last (kubeconfig
) volume, in addition to path:
& type:
lines you see above.
Hi, I tried to use the same command as you did and this was my output:
./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
- mountPath: /etc/kubernetes/scheduler-config.yaml
name: schedulerconfig
readOnly: true
- mountPath: /host/certs
name: certdir
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler-config.yaml
name: schedulerconfig
- hostPath:
path: /etc/certs
name: certdir
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
volumeMounts:
- mountPath: /etc/kubernetes/scheduler-config-tas+gas.yaml
name: schedulerconfig
readOnly: true
- mountPath: /host/certs
name: certdir
- mountPath: /etc/kubernetes/scheduler-config.yaml
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler-config-tas+gas.yaml
name: schedulerconfig
- hostPath:
path: /etc/certs
name: certdir
- hostPath:
path: /etc/kubernetes/scheduler-config.yaml
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
The problem I found is here :
-- hostPath: path: /etc/kubernetes/scheduler-config.yaml path: /etc/kubernetes/scheduler.conf type: FileOrCreate name: kubeconfig
From what I could tell the root cause is this line https://github.com/intel-innersource/libraries.orchestrators.resourcemanagement.platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh#L162
sed -i "/$scheduler_config_file/d" "$MANIFEST_FILE"
I started with the default manifest file (deploy/extender/scheduler-config.yaml), but when I specified a new one -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml, the $MANIFEST_FILE would be scheduler-config-tas+gas.yaml not scheduler-config.yaml so sed can't find and won't remove it.
Not sure if this is exactly the same problem as in my case, the scheduler was still working (the scheduler pod started successfully, and then I was able to schedule the demo pod according to the TAS health-demo policy). I created a follow-up item for us to work on to address this particular problem.
In terms of priorities, I think we will be starting with 2 items first: fixing the bug above and having the file create backups as I think these are more urgent. I'll be picking these up in the next sprint (following week). I will update this issue once they will be released.
IMHO YAML modifications should be done with something that actually understands YAML (e.g. Python script using pyyaml
module), otherwise it's too easy for the script to break things.
Hi,
Apologies for the late reply. I logged an issue in our backlog about this problem (it was also referenced here https://github.com/intel/platform-aware-scheduling/issues/86), but we weren't able to pick-up the work so far. I'm working on a new release for TAS, but unfortunately this change did not make it.
My plan is to look this week/Monday at what changes we can work on in March and then what we plan for the next quarter. I'll be able to give a more specific date then.
Just to keep things clear, I will continue to update this issue regarding the changes required for the configuration script and just update issue#86 with regards to your last request.
Updating Kubernetes scheduler config using the script & (GAS) config in this repo, breaks Kubernetes 1.26 config:
./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml
It messed up the last volume specification in the config so that there were couple of extra lines for the last volume specification.
Besides fixing that in the script, I think script should take a backup of the config file, and show user a diff of the changes it did.
Maybe also ask user whether changes should be accepted (unless something like "--yes" is specified), like e.g. Debian configuration packages script updates do.
You might also consider using kustomize to do the file updates, as that is actually designed for semantic updating of k8s YAML files.
(Which still leaves the issue of upgrade overwriting the changes, as mentioned in https://github.com/intel/platform-aware-scheduling/issues/86.)