Training/Tokenizing Sequence Length error.

swcrazyfan commented 3 years ago

I'm trying to train a model based on GPT Neo 125M, and I keep getting this error. It continues to train and even create text, but I'm pretty sure this will affect my final model. Is there a way I should prepare the data or a setting I should change?

Currently, I'm using text that was exported from a PDF. I did some basic preprocessing, but I'm not sure if it was enough.

04/30/2021 07:53:28 — INFO — aitextgen — Loading text from tbt.txt with generation length of 2048. 100% 46/46 [00:00<00:00, 141.07it/s] 04/30/2021 07:53:28 — INFO — aitextgen.TokenDataset — Encoding 46 sets of tokens from tbt.txt. Token indices sequence length is longer than the specified maximum sequence length for this model (7310 > 2048). Running this sequence through the model will result in indexing errors 04/30/2021 07:53:28 — INFO — pytorch_lightning.utilities.distributed — GPU available: True, used: True 04/30/2021 07:53:28 — INFO — pytorch_lightning.utilities.distributed — TPU available: False, using: 0 TPU cores 04/30/2021 07:53:28 — INFO — pytorch_lightning.trainer.connectors.accelerator_connector — Using native 16bit precision. 04/30/2021 07:53:28 — INFO — pytorch_lightning.accelerators.gpu — LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

minimaxir commented 3 years ago

That's a weird notification. There may be a bug, although it shouldn't affect the final training.

How is your dataset structured?

swcrazyfan commented 3 years ago

To be honest, I'm pretty new to ML, so I'm not sure if I structured the data correctly or even how to tell you the way it's structured.

It's the text of a book. Right now, it's basically just pure text without empty lines. Roughly each paragraph or chapter title is it's own line.

Do you know of a good place to learn the basics of data preprocessing? Most things I've found seem to assume more knowledge than I currently have, but I'm trying to learn fast haha.

mesotron commented 3 years ago

I'm getting this notification as well. I have extremely long stretches of text between newlines in my dataset, so maybe that's it. In any case, it doesn't seem to be having trouble as far as I can tell. (edit: that is, didn't seem to be having trouble, as of last week; training of GPT-Neo in Colab currently seems to be broken as of 06-May-2021)

swcrazyfan commented 3 years ago

That's a weird notification. There may be a bug, although it shouldn't affect the final training.

How is your dataset structured?

As you said, it doesn't seem to affect the results. Thank you!

minimaxir / aitextgen

Training/Tokenizing Sequence Length error. #124