[Error] with Trainer: TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_broadcast_one`.

halixness commented 3 months ago

System Info

transformers version: 4.42.4
Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.24.0
Safetensors version: 0.4.2
Accelerate version: 0.32.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:

Who can help?

@muellerzr @SunMarc @ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

https://gist.github.com/halixness/eadd6d1d89ae48597f70cb09f2b44139

Expected behavior

Hello, I have written a simple training script to train from scratch a gpt2-like model with a large dataset of strings (molecules in SMILES format). After around ~2k steps (batch_size=128, #samples = ~1.5M), I encounter the following error:

TypeError: Unsupported types (<class 'NoneType'>) passed to `_gpu_broadcast_one`. Only nested list/tuple/dicts of objects that are valid for `is_torch_tensor` should be passed.

I tried already:

to use the default_data_collator instead and to manually group samples as in in the official example.
to check manually for the value of the batch that makes the script crash apparently: no NaN values, it all seems to make sense.
to check whether the dataset initially contains any empty or None strings, which is not the case.

I'm not sure about what could case this error. Any suggestion is much appreciated!

ArthurZucker commented 3 months ago

Hey! One thing to check is if you correctly set the tokenizer.unk_token (from the gist it does not seem like it), which could produce Non inputs when it finds something outside your vocabulary. As the rest of the training seems to work as expected, really think it's related to tokenization here!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers