unable to train - Githubissues

hkust-nlp / dart-math

[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

https://hkust-nlp.github.io/dart-math/

MIT License

65 stars 3 forks source link

unable to train #2

Closed candygocandy closed 2 months ago

candygocandy commented 2 months ago

I did exactly what you did, installed, and ran the following command.

bash scripts/train-single-node.sh \ --data_path "hkust-nlp/dart-math-hard" \ --model_path "meta-llama/Meta-Llama-3-8B" \ --lr "5e-5" --bs 64 --n_grad_acc_steps 1 --n_epochs 1 \ --gpu_ids "0,1,2,3,4,5,6,7" \ --output_dir "models/dart-math-llama3-8b-prop2diff"

It gave me the rror: ..pipeline/train.py line 265 train() ... .. "../transformer/trainer.py", line 2279, tr_loss_step = self.traning_step(model,inputs) .... "../transformer/models/llama/modeling_llama.py", line 996, in _update_causual_mask if attention_mask is not None and 0.0 in attention_mask: ".../torch/tensor.py" line 1116, contains return (element == self).any().item() RunetimeError: CUDA error: device-side asssert trigger

candygocandy commented 2 months ago

I think you make supervised dataset has the tokenize error, it is trying to query the out the list index?

candygocandy commented 2 months ago

@tongyx361 , if you clear your cached tokenized .pt file, and you run the script, you see see what I am talking about

candygocandy commented 2 months ago

@tongyx361 , do you have bandwidth to fix it? sorry for being super pushy... I am currently working on something that builds on top of this and it is kind of deadline approaching....

tongyx361 commented 2 months ago

@tongyx361 , do you have bandwidth to fix it? sorry for being super pushy... I am currently working on something that builds on top of this and it is kind of deadline approaching....

Thanks for your understanding that I am indeed quite busy these days. But to help you catch the deadline, I will try my best to fix this asap, though actually I haven't met similar errors for my experiments through several times of setups.

tongyx361 commented 2 months ago

Happy to reproduce the error and fix it at least in my environment. This error seems to arise from in-compatibility with transformer>=4.43, instead of tokenization issues. You might fix it by running:

pip install "transformers<4.43"

btw, you might test with a more light-weight command before main experiments:

bash scripts/train-single-node.sh \
    --data_path "hendrycks/competition_math" --query_field problem --resp_field solution \
    --model_path "TinyLlama/TinyLlama-1.1B-intermediate-step-240k-503b" \
    --lr "5e-5" --bs 16 --n_grad_acc_steps 1 --n_epochs 1 \
    --gpu_ids "0" \
    --output_dir "models/tinyllama-1.1B-math"

candygocandy commented 2 months ago

thx bro