Error: Token indices sequence length is longer than the specified maximum sequence length for this model

Akshay0799 commented 2 years ago

So, I've been trying to train the model on the entire dataset but I kept facing the "CUDA out of memory" error. So I thought of reducing the batch size. This is my config file:

"concept_to_control": "domain", "values_to_control": ["airline", "dvd", "electronics", "kitchen"], "splits_for_training": ["unlabeled"], "splits_for_augmentations": ["train", "validation"], "t5_model_name": "t5-base", "max_seq_len": 96, "seed": 42, "batch_size": 16, "min_occurrences": 10, "smoothing": [1, 5, 7], "n_orientations": 4, "top_occurrences_threshold": 100, "n_grams": 3, "masked_output": false, "threshold": 0.08, "top_n": 0.05, "noise": 0.05, "unknown_orientation_p": 0.05, "generator_epochs": 5, "generator_classifier_epochs": 3, "generator_classifier_batch_size": 16, "fast_dev": false

Even now it throws the following error when I try to train the T5 Generator model Token indices sequence length is longer than the specified maximum sequence length for this model (67807 > 512). Running this sequence through the model will result in indexing errors

nitaytech commented 2 years ago

Hi, thanks Are you sure this is an error and not just a warning? This should be a warning you can ignore (which happens because we extract all the possible token ids of each domain, by tokenizing all the words in each domain).

Akshay0799 commented 2 years ago

Yeah you are right about it being a warning, but the program terminates immediately and doesn't train

nitaytech commented 2 years ago

Hi @Akshay0799 , we added a lite and simple single script docogen_lite.py -- I believe running it will be much easier. The lite version does not use Pytorch-Lightning (I think your errors are related to the lightning package). The new script is more simple and uses only transformers and datasets packages (+nltk / sklearn...).

Hope it will work for you!

Akshay0799 commented 2 years ago

THank you so much!

nitaytech / DoCoGen

Error: Token indices sequence length is longer than the specified maximum sequence length for this model #4