OOM while finetuning Starcoder

I really appreciate you releasing this work. I have been trying to do something similar with the original Starcoder finetuning code but have had a variety of issues. Unfortunately, when I run this script on my own dataset (it's only around 6800 MOO verbs) I get a pretty rapid OOM on a machine with 8x A100 80gb cards. At first I thought it was because I was trying to increase max_seq_size, (I was hoping for 1024 tokens) but dropping it back to 512 gave me the same issue. I then tried reducing batch size to 1, but that also did not work and errored out with insufficient memory again. The only other thing I changed is the prompt, although I made very minor changes to that, mostly just changing the language to my own and picking different columns out of my dataset.

Here is my run.sh:

#! /usr/bin/env bash
set -e # stop on first error
set -u # stop if any variable is unbound
set -o pipefail # stop if any command in a pipe fails

LOG_FILE="output.log"
TRANSFORMERS_VERBOSITY=info

get_gpu_count() {
  local gpu_count
  gpu_count=$(nvidia-smi -L | wc -l)
  echo "$gpu_count"
}

gpu_count=$(get_gpu_count)
echo "Number of GPUs: $gpu_count"

train() {
    local script="$1"
    shift 1
    local script_args="$@"

    if [ -z "$script" ] || [ -z "$script_args" ]; then
        echo "Error: Missing arguments. Please provide the script and script_args."
        return 1
    fi

    { torchrun --nproc_per_node="$gpu_count" "$script" $script_args 2>&1; } | tee -a "$LOG_FILE"
    }

train train.py \
    --model_name_or_path "bigcode/starcoder" \
    --data_path ./verbs_augmented/verbs_augmented.jsonl \
    --bf16 True \
    --output_dir moocoder \
    --num_train_epochs 2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard" \
    --fsdp_transformer_layer_cls_to_wrap 'GPTBigCodeBlock' \
    --tf32 True

Any idea what might be going wrong here/can I give you any more info to help me figure this out?

minosvasilias / godot-dodo

OOM while finetuning Starcoder #10