substratusai / images

Official Substratus Container Images
1 stars 0 forks source link

model-trainer-huggingface: resume from checkpoint #17

Closed samos123 closed 1 year ago

samos123 commented 1 year ago

Testing using following resource:

apiVersion: substratus.ai/v1
kind: Model
metadata:
  name: falcon-7b-instruct-k8s
spec:
  image:
    name: substratusai/model-trainer-huggingface:pr-17
  baseModel:
    name: falcon-7b-instruct
  trainingDataset:
    name: k8s-instructions
  params:
    num_train_epochs: 1
    save_steps: 5
  resources:
    gpu:
      count: 4
      type: nvidia-l4

test case:

Note the huggingface default for save_steps is 500 which means it only stores checkpoints once every 500 steps: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.save_steps