kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
721 stars 176 forks source link

gangSchdName默认值与最新版本的kube-batch默认名称不一致导致使用刚性调度失败 #391

Open yajunwong opened 3 years ago

yajunwong commented 3 years ago

hi 最新版本的kube-batch的deployment默认名称是kube-batch,arena设置的gangSchdName默认值是kube-batchd, 不一致导致无法使用刚性调度

建议跟最新版版保持一致 https://github.com/kubernetes-sigs/kube-batch/commit/70957dc6f0134023d35b3ac6156ef520ffbe6196

yajunwong commented 3 years ago
-bash-4.2$ grep hasGan  /tmp/values154110043
hasGangScheduler: false
-bash-4.2$ kubectl get deployments
NAME          READY     UP-TO-DATE   AVAILABLE   AGE
kube-batchd   1/1       1            1           7m39s
-bash-4.2$

修改kube-batch的deployment名称之后,arena还是把hasGangScheduler设置成了false,无法使用刚性调度功能

yajunwong commented 3 years ago
EBU[0000] Use specified kubeconfig file /etc/kubernetes/admin.conf
DEBU[0000] init arena config
DEBU[0000] illegal arena config file:  due to stat : no such file or directory
DEBU[0000] illegal arena config file: /home/ltops/.arena/config due to stat /home/ltops/.arena/config: no such file or directory
DEBU[0000] auto detect namespace default
DEBU[0000] Get K8S used ports, [0 0 9000 8000 9000 8000 9000 8000 0 0 0 0 0 9100 9100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9100 0 0 0 9100 9000 8000 9100 9100 9100 9100 9100 9100 9100 9100 9100 9000 8000 9000 8000 9100 9100 9100 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9100 9100 9100 9100 9100 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9000 8000 9100 9100 9100 9000 8000 9000 8000 9000 8000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31941 31287 32101]
DEBU[0000] Failed to find kube-batchd due to the server could not find the requested resource
DEBU[0000] Supported cleanTaskPolicy: Running
DEBU[0000] No action for sync Code
DEBU[0000] dataDir: []
DEBU[0000] dataset: []
DEBU[0000] annotations: []
DEBU[0000] Current user: 1016
DEBU[0000] PodSecurityContext {1016 true 1016 [1016]}
DEBU[0000] tolerations: []
DEBU[0000] imagePullSecrets: []
DEBU[0000] node selectors: []
DEBU[0000] psSelectors: []
DEBU[0000] workerSelectors: []
DEBU[0000] evaluatorSelectors: []
DEBU[0000] chiefSelectors: []
DEBU[0000] Init TensorFlow job trainer
DEBU[0000] Check fine2 exist due to error Failed to find the job for fine2
DEBU[0000] Exec /usr/local/bin/arena-kubectl, [get configmap fine2-tfjob --namespace default]
DEBU[0000] Failed to execute kubectl, [get configmap fine2-tfjob --namespace default] with exit status 1
DEBU[0000] No resources found.
Error from server (NotFound): configmaps "fine2-tfjob" not found

DEBU[0000] Save the values file /tmp/values156024540
DEBU[0000] values: &{TFNodeSelectors:map[Evaluator:map[] Chief:map[] PS:map[] Worker:map[]] Port:0 WorkerImage:quay.io/wangyajun/minist:v0.2 WorkerPort:20000 PSPort:20002 PSCount:20 PSImage:quay.io/wangyajun/minist:v0.2 WorkerCpu:20 WorkerMemory:20Gi PSCpu:20 PSGpu:0 PSMemory:20Gi CleanPodPolicy:Running UseChief:true ChiefCount:1 UseEvaluator:true ChiefPort:20001 ChiefCpu:20 ChiefMemory:20Gi EvaluatorCpu:20 EvaluatorMemory:20Gi EvaluatorCount:1 HasGangScheduler:false submitArgs:{NodeSelectors:map[] ConfigFiles:map[] Tolerations:[] Image:quay.io/wangyajun/minist:v0.2 GPUCount:0 Envs:map[CODE_PACK:s3://abc-aiapp/individual/yajun/model/fine2/shein-ctr.tar.gz ENV_PACK:s3://abc-aiapp/individual/zhangzhen/parallel/env/SheinCTR.tar.gz ENTRYPOINT:entrypoint.sh workers:200 gpus:0] WorkingDir:/opt/tfjob Command:bash run.sh Mode:tfjob WorkerCount:200 Retry:0 DataSet:map[] DataDirs:[] EnableRDMA:false UseENI:false Annotations:map[] IsNonRoot:true PodSecurityContext:{RunAsUser:1016 RunAsNonRoot:true RunAsGroup:1016 SupplementalGroups:[1016]} PriorityClassName: Conscheduling:false PodGroupName: PodGroupMinAvailable: ImagePullSecrets:[]} submitTensorboardArgs:{UseTensorboard:false TensorboardImage:registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel TrainingLogdir:/training_logs HostLogPath: IsLocalLogging:false} submitSyncCodeArgs:{SyncMode: SyncSource: SyncImage: SyncGitProjectName:} tfRuntime:<nil>}
DEBU[0000] Exec bash -c [/usr/local/bin/arena-helm template -f /tmp/values156024540 --namespace default --name fine2 /charts/tfjob > /tmp/fine2.yaml973860491]
DEBU[0000] Generating template  [/usr/local/bin/arena-helm template -f /tmp/values156024540 --namespace default --name fine2 /charts/tfjob > /tmp/fine2.yaml973860491]
DEBU[0000] Exec /usr/local/bin/arena-kubectl, [create --dry-run --namespace default -f /tmp/fine2.yaml973860491]
DEBU[0000] Save the config file /tmp/config209429614
DEBU[0000] dry run result: [tfjob.kubeflow.org/fine2 created (dry run) ]
DEBU[0000] cols: [tfjob.kubeflow.org/fine2 created (dry run)], 4
DEBU[0000] cols: [], 0
DEBU[0000] Exec bash -c [/usr/local/bin/arena-helm inspect chart /charts/tfjob | grep version:]
DEBU[0000] Exec /usr/local/bin/arena-kubectl, [create configmap fine2-tfjob --namespace default --from-file=values=/tmp/values156024540 --from-file=app=/tmp/config209429614 --from-literal=tfjob=0.30.0]
configmap/fine2-tfjob created
DEBU[0000] Exec /usr/local/bin/arena-kubectl, [label configmap fine2-tfjob --namespace default createdBy=arena]
configmap/fine2-tfjob labeled
DEBU[0000] Exec bash -c [cat /tmp/config209429614 | xargs /usr/local/bin/arena-kubectl delete --namespace default]
DEBU[0001]
DEBU[0001] Failed to execute bash -c, [cat /tmp/config209429614 | xargs /usr/local/bin/arena-kubectl delete --namespace default] with exit status 123
DEBU[0001] Failed to UninstallAppsWithAppInfoFile due to exit status 123
DEBU[0001] Exec /usr/local/bin/arena-kubectl, [apply --namespace default -f /tmp/fine2.yaml973860491]
DEBU[0001] tfjob.kubeflow.org/fine2 created

tfjob.kubeflow.org/fine2 created
INFO[0001] The Job fine2 has been submitted successfully
INFO[0001] You can run `arena get fine2 --type tfjob` to check the job status
yajunwong commented 3 years ago

k8s版本:

[root@cpu01 arena]# kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.12+k3s1", GitCommit:"56cd36302dc3188f21f9877d1309df7d80cd8b7d", GitTreeState:"clean", BuildDate:"2020-11-13T06:12:38Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

api-versions

admissionregistration.k8s.io/v1
admissionregistration.k8s.io/v1beta1
apiextensions.k8s.io/v1
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1
apiregistration.k8s.io/v1beta1
apps/v1
arbitrator.incubator.k8s.io/v1
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
autoscaling/v2beta2
batch/v1
batch/v1beta1
certificates.k8s.io/v1beta1
coordination.k8s.io/v1
coordination.k8s.io/v1beta1
discovery.k8s.io/v1beta1
events.k8s.io/v1beta1
extensions/v1beta1
helm.cattle.io/v1
k3s.cattle.io/v1
kai.alibabacloud.com/v1alpha1
kubedl.io/v1alpha1
kubeflow.org/v1
kubeflow.org/v1alpha1
kubeflow.org/v1alpha2
kubeflow.org/v1beta1
metrics.k8s.io/v1beta1
monitoring.coreos.com/v1
monitoring.coreos.com/v1alpha1
networking.k8s.io/v1
networking.k8s.io/v1beta1
node.k8s.io/v1beta1
operators.coreos.com/v1
operators.coreos.com/v1alpha1
operators.coreos.com/v1alpha2
packages.operators.coreos.com/v1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
scheduling.incubator.k8s.io/v1alpha1
scheduling.k8s.io/v1
scheduling.k8s.io/v1beta1
scheduling.sigs.dev/v1alpha2
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1
xdl.kubedl.io/v1alpha1
xgboostjob.kubeflow.org/v1alpha1
happy2048 commented 3 years ago

this bug has been fixed, please try it again.