Closed candygocandy closed 2 months ago
I think you make supervised dataset has the tokenize error, it is trying to query the out the list index?
@tongyx361 , if you clear your cached tokenized .pt file, and you run the script, you see see what I am talking about
@tongyx361 , do you have bandwidth to fix it? sorry for being super pushy... I am currently working on something that builds on top of this and it is kind of deadline approaching....
@tongyx361 , do you have bandwidth to fix it? sorry for being super pushy... I am currently working on something that builds on top of this and it is kind of deadline approaching....
Thanks for your understanding that I am indeed quite busy these days. But to help you catch the deadline, I will try my best to fix this asap, though actually I haven't met similar errors for my experiments through several times of setups.
Happy to reproduce the error and fix it at least in my environment.
This error seems to arise from in-compatibility with transformer>=4.43
, instead of tokenization issues.
You might fix it by running:
pip install "transformers<4.43"
btw, you might test with a more light-weight command before main experiments:
bash scripts/train-single-node.sh \
--data_path "hendrycks/competition_math" --query_field problem --resp_field solution \
--model_path "TinyLlama/TinyLlama-1.1B-intermediate-step-240k-503b" \
--lr "5e-5" --bs 16 --n_grad_acc_steps 1 --n_epochs 1 \
--gpu_ids "0" \
--output_dir "models/tinyllama-1.1B-math"
thx bro
I did exactly what you did, installed, and ran the following command.
bash scripts/train-single-node.sh \ --data_path "hkust-nlp/dart-math-hard" \ --model_path "meta-llama/Meta-Llama-3-8B" \ --lr "5e-5" --bs 64 --n_grad_acc_steps 1 --n_epochs 1 \ --gpu_ids "0,1,2,3,4,5,6,7" \ --output_dir "models/dart-math-llama3-8b-prop2diff"
It gave me the rror: ..pipeline/train.py line 265 train() ... .. "../transformer/trainer.py", line 2279, tr_loss_step = self.traning_step(model,inputs) .... "../transformer/models/llama/modeling_llama.py", line 996, in _update_causual_mask if attention_mask is not None and 0.0 in attention_mask: ".../torch/tensor.py" line 1116, contains return (element == self).any().item() RunetimeError: CUDA error: device-side asssert trigger