Open sanderland opened 2 months ago
I am waiting for #1513 break changes to happend to start continual pretrain LlaMA-3 with extended vocab et all.
Not sure when this merge will happend (v0.19.2 I guess) as it is critical for LLaMA-3 for non-English corpus.
Cheers, Steve
@thusinh1969 What are you finding wrong with 0.19.1?
@thusinh1969 What are you finding wrong with 0.19.1?
The decoder was buggy for added token when we want to extend vocab for non-English. Being fixed I think.
https://github.com/meta-llama/llama3/issues/67
Steve
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Yep the breaking change will be reverted, but we will still ship the new addition of tokens for BPE. Just gimme a week!
v0.19.0 id 112328 decodes to ' Arthropoda', which encodes to... [(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]
v0.19.1 id 112328 decodes to ' Arthropoda', which encodes to... [(112328, ' Arthropoda')]
I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.