kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
720 stars 176 forks source link

Model information does not display correctly when getting a training job #1067

Closed ChenYi015 closed 3 months ago

ChenYi015 commented 3 months ago

Register a model version when submitting a training job:

$ arena submit pytorchjob \
    --name=bloom-sft-2 \
    --gpus=1 \
    --image=registry.cn-hangzhou.aliyuncs.com/acs/deepspeed:v0.9.0-chat \
    --label=xxx=yyy \
    --data=training-data:/model \
    --model-name=my-model \
    --model-source=pvc://default/training-data/bloom-560m-sft \
    "cd /model/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning && bash training_scripts/other_language/run_chinese.sh /model/bloom-560m-sft"

pytorchjob.kubeflow.org/bloom-sft-2 created
INFO[0001] The Job bloom-sft-2 has been submitted successfully 
INFO[0001] You can run `arena get bloom-sft-2 --type pytorchjob -n default` to check the job status 
INFO[0001] registered model "my-model" created          
INFO[0001] model version 1 for "my-model" created 

The info shows that model version 1 for model named my-model was created, but when getting the job, the model name is bloom-sft-2 rather than my-model:

$ arena get bloom-sft-2
Name:          bloom-sft-2
Status:        PENDING
Namespace:     default
Priority:      N/A
Trainer:       PYTORCHJOB
Duration:      3m
CreateTime:    2024-04-09 20:29:46
EndTime:       
ModelName:     bloom-sft-2
ModelVersion:  1
ModelSource:   pvc://default/training-data/bloom-560m-sft

Instances:
  NAME                  STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                  ------   ---  --------  --------------  ----
  bloom-sft-2-master-0  Pending  3m   true      1               N/A