Open nivibilla opened 1 month ago
Actually Im stupid. I figured it out while I was typing the issue. I should be looking at the vocab size not the tokenizer length.
Is it worth adding a check in the GKD trainer for this param so this error is more readable for others?
Llama 3.1 70b and llama 3.2 1B seem to have the same vocab size I will test with that. It will probably work.
System Info
Latest TRL from source, can't run TRL env rn as cluster is shut down but I'm installing everything from source.
If required will restart cluster and run.
Information
Tasks
examples
folderReproduction
For further details. Teacher is Qwen 2.5 72B instruct. Student is Qwen 2.5 3B instruct.
Training Config :
Error Trace:
I originally thought the difference might be due to the max seq len which is 1382 based on my dataset max. But the difference in dimensions reported 128.
Expected behavior
Both the tokenizer for qwen 72b and 3b have a max length of 131072 not sure where the 151k numbers are coming from
Since it's the same tokenizer I assume it should be possible to distill them right?