Open cmosguy opened 1 year ago
Thanks @muellerzr for your quick response... I tried the following modifications below:
--data_column "content" \
--split="train" \
--seq_length 2048 \
--max_steps 100 \
--batch_size 2 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--num_warmup_steps 10 \
--eval_freq 10 \
--save_freq 10 \
--log_freq 1 \
--bf16 \
--fim_rate 0.5 \
--fim_spm_rate 0.5
I kept batch_size=2
, but when I decreased the max_steps=100
it still fails with the same error. Is that what you meant, lower the max_steps
number to a lower value than total train samples?
Well.... I just ran it, and it looks like it went fine for this dataset. It is very strange. I don't understand why it went just fine for this dataset and not mine.
--model_path="bigcode/santacoder" \
--dataset_name="bigcode/the-stack-dedup" \
--subset="data/python" \
--output_dir "./checkpoints/santacoder-the-stack-dedup-python-debug-298-samples" \
--data_column "content" \
--split="train[0:298]" \
--seq_length 2048 \
--max_steps 1000 \
--batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-5 \
--num_warmup_steps 10 \
--eval_freq 100 \
--save_freq 100 \
--log_freq 1 \
--bf16 \
--fim_rate 0.5 \
--fim_spm_rate 0.5
do you have any words of wisdom on this? is there any way to unit test this piece with my data so i can understand the root cause. I do not understand why the batch
is None
when it gets to your updated code adjustment.
Hey @loubnabnl,
Thanks for this repo - I've learned a lot from what you implemented here.
I am encountering a strange error when I attempt to use the command:
If I run this - I end up getting an error with that says:
I hit this error when I pass through the 0.1 mark of the epoch:
My dataset train and test is small and looks like this:
Do you have any thoughts here what I need to do to adjust the training loop? Is it because my train set is too small?
Thanks! Adam