Closed halixness closed 1 month ago
Hey! One thing to check is if you correctly set the tokenizer.unk_token
(from the gist it does not seem like it), which could produce Non inputs when it finds something outside your vocabulary.
As the rest of the training seems to work as expected, really think it's related to tokenization here!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.42.4Who can help?
@muellerzr @SunMarc @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://gist.github.com/halixness/eadd6d1d89ae48597f70cb09f2b44139
Expected behavior
Hello, I have written a simple training script to train from scratch a gpt2-like model with a large dataset of strings (molecules in SMILES format). After around ~2k steps (
batch_size=128
,#samples = ~1.5M
), I encounter the following error:I tried already:
default_data_collator
instead and to manually group samples as in in the official example.I'm not sure about what could case this error. Any suggestion is much appreciated!