Open lyf-00 opened 8 months ago
Same problem for me! Do you solve it? I also can not reproduce the results whatever the Mistral or LLaMA
I encountered the same issue when trying to train Mistral-7B on MetaMathQA. My environment is:
transformers==4.34.0
torch==2.0.1
sentencepiece==0.1.99
tokenizers==0.14.1
accelerate==0.21.0
I only got a 69% accuracy on GSM8K and 24% on MATH after 3 epochs with LR 5e-6 and global batch 128. Although due to the limitation of my computational resources, I added gradient checkpointing and flash attention to the original code, and also changed the per_device_batch_size to 1 (so gradient accumulates for 16 steps on 8 GPUs), but I don't think these modifications will bring significant differences to the performance.
My result with llama-factory and hyperparameters reported in the paper is 72.2%on GSM8K, I do not under stand why there are so many failures when trying to reproduce the result.
Hello, I attempt to replicate the experiment using metamathQA dataset to finetune mistral-7b, but the results I obtained do not match the ones shared in the repository.
Reproduction steps
I used the following parameters in
run_mistral.sh
.and I get gsm8k acc==== 0.6618650492797574 math acc==== 0.2274
which is different from the reported 77.7 and 28.2
Environment details
Here are the detailed of my Python environment:
I would appreciate any guidance or suggestions you could provide to help resolve this discrepancy. Thank you in advance for your time and assistance.
Best regards, lyf-00