Closed samos123 closed 1 year ago
Testing using following resource:
apiVersion: substratus.ai/v1 kind: Model metadata: name: falcon-7b-instruct-k8s spec: image: name: substratusai/model-trainer-huggingface:pr-17 baseModel: name: falcon-7b-instruct trainingDataset: name: k8s-instructions params: num_train_epochs: 1 save_steps: 5 resources: gpu: count: 4 type: nvidia-l4
test case:
Note the huggingface default for save_steps is 500 which means it only stores checkpoints once every 500 steps: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.save_steps
Testing using following resource:
test case:
Note the huggingface default for save_steps is 500 which means it only stores checkpoints once every 500 steps: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.save_steps