Closed MachineGunLin closed 5 months ago
If I set atol to 1e-4, the result will be 224 / 739, I don't know if this is within the normal margin of error
You mentioned that one of the key features of unsloth is "no approximation methods - all exact", so the weights should be the same. I don't understand what's wrong with my code
Oh this is entirely normal - due to different internal upcasting in Triton kernels and downcasting, you will see some small fractional differences. Sometimes you might see a little lower loss or tinnily higher loss, it all depends.
No approximation and all exact indeed, (ie not approximating attention etc). You will see some precision differences, but it's entirely normal :)
I normally compare the training losses, and see if mostly they match - that's the most important part.
If I set atol to 1e-4, the result will be 224 / 739, I don't know if this is within the normal margin of error
Hi,
Did you find when running the same training procedure using unsloth, for example, " fine-tune Llama-3-8B-Instruct (with unsloth)", the training loss will be different and the performance on test dataset is also inconsistent. It is too strange.
@x6p2n9q8a4 What do you mean "different" training losses? Can you provide the example you did with Unsloth and the other with normal HF? I can take a look to see if there are issues
I use this code to fine-tune Llama-3-8B-Instruct (with unsloth):
For comparison, I used the following code to fine-tune Llama 3-8B-Instruct using trl:
Then, I compared whether the weights of the two methods are consistent after two steps:
The results of Meta-Llama 3-8B-Instruct are as follows:
After only two steps, the weight of unsloth and trl is more than 1e-3 (5 out of 739 weights), may I ask what might be the cause?