error fast when tokenizer=null and dataset.packed=True

pytorch / torchtune

PyTorch native finetuning library

https://pytorch.org/torchtune/main/

BSD 3-Clause "New" or "Revised" License

4.39k stars 448 forks source link

error fast when tokenizer=null and dataset.packed=True #2055

Open felipemello1 opened 6 days ago

felipemello1 commented 6 days ago

When we set dataset.packed = True, we expected that for tokenizer to have mas_seq_len. It not, we raise an error. However, this error is only raised after the model was already loaded in memory. We should error much faster, possibly in the init of the recipe.

File "/data/users/felipemello/torchtune/torchtune/datasets/_alpaca.py", line 91, in alpaca_dataset
[rank5]:     raise ValueError(
[rank5]: ValueError: PackedDataset requires a max_seq_len to be set on the tokenizer.

RdoubleA commented 6 days ago

We can add this to the underlying function for tune validate and call config validation at the beginning of recipe. In fact we should move all these similar validation checks from the recipe into a single validation function that can be shared across all recipes