loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
184 stars 23 forks source link

ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration. #23

Open cmosguy opened 1 year ago

cmosguy commented 1 year ago

Hey @loubnabnl,

Thanks for this repo - I've learned a lot from what you implemented here.

I am encountering a strange error when I attempt to use the command:

python santacoder-finetuning/train.py \
        --model_path="bigcode/santacoder" \
        --dataset_name="json" \
        --subset="./mydataset/" \
        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 1000 \
        --batch_size 2 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-5 \
        --num_warmup_steps 100 \
        --eval_freq 100 \
        --save_freq 100 \
        --log_freq 1 \
        --no_fp16 \
        --fim_rate 0.5 \
        --fim_spm_rate 0.5

If I run this - I end up getting an error with that says:

ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration.

I hit this error when I pass through the 0.1 mark of the epoch:

{'loss': 0.2896, 'learning_rate': 4.9e-05, 'epoch': 0.1}
{'loss': 0.2095, 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.1}
{'loss': 0.291, 'learning_rate': 5e-05, 'epoch': 0.1}
Traceback (most recent call last):
  File "../santacoder-finetuning/train.py", line 289, in <module>

  File "../santacoder-finetuning/train.py", line 279, in main
    run_training(args, train_dataset, eval_dataset)
  File "../santacoder-finetuning/train.py", line 268, in run_training
    trainer.train()
  File " /lib/python3.10/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File " /lib/python3.10/site-packages/transformers/trainer.py", line 1930, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File " /lib/python3.10/site-packages/transformers/trainer.py", line 2257, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File " /lib/python3.10/site-packages/transformers/trainer.py", line 2982, in evaluate
    output = eval_loop(
  File " /lib/python3.10/site-packages/transformers/trainer.py", line 3161, in evaluation_loop
    for step, inputs in enumerate(dataloader):
  File " /lib/python3.10/site-packages/accelerate/data_loader.py", line 582, in __iter__
    raise ValueError(
ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration.

My dataset train and test is small and looks like this:

Size of the train set: 295. Size of the validation set: 2
 74%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                 | 295/400 [00:00<00:00, 452.00it/s]
The character to token ratio of the dataset is: 3.90

Do you have any thoughts here what I need to do to adjust the training loop? Is it because my train set is too small?

Thanks! Adam

cmosguy commented 1 year ago

Thanks @muellerzr for your quick response... I tried the following modifications below:

        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 100 \
        --batch_size 2 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-5 \
        --num_warmup_steps 10 \
        --eval_freq 10 \
        --save_freq 10 \
        --log_freq 1 \
        --bf16 \
        --fim_rate 0.5 \
        --fim_spm_rate 0.5

I kept batch_size=2, but when I decreased the max_steps=100 it still fails with the same error. Is that what you meant, lower the max_steps number to a lower value than total train samples?

cmosguy commented 1 year ago

Well.... I just ran it, and it looks like it went fine for this dataset. It is very strange. I don't understand why it went just fine for this dataset and not mine.

        --model_path="bigcode/santacoder" \
        --dataset_name="bigcode/the-stack-dedup" \
        --subset="data/python" \
        --output_dir "./checkpoints/santacoder-the-stack-dedup-python-debug-298-samples" \
        --data_column "content" \
        --split="train[0:298]" \
        --seq_length 2048 \
        --max_steps 1000 \
        --batch_size 2 \
        --gradient_accumulation_steps 8 \
        --learning_rate 5e-5 \
        --num_warmup_steps 10 \
        --eval_freq 100 \
        --save_freq 100 \
        --log_freq 1 \
        --bf16 \
        --fim_rate 0.5 \
        --fim_spm_rate 0.5

do you have any words of wisdom on this? is there any way to unit test this piece with my data so i can understand the root cause. I do not understand why the batch is None when it gets to your updated code adjustment.