Closed Akshay0799 closed 2 years ago
Hi, thanks Are you sure this is an error and not just a warning? This should be a warning you can ignore (which happens because we extract all the possible token ids of each domain, by tokenizing all the words in each domain).
Yeah you are right about it being a warning, but the program terminates immediately and doesn't train
Hi @Akshay0799 , we added a lite and simple single script docogen_lite.py -- I believe running it will be much easier. The lite version does not use Pytorch-Lightning (I think your errors are related to the lightning package). The new script is more simple and uses only transformers and datasets packages (+nltk / sklearn...).
Hope it will work for you!
THank you so much!
So, I've been trying to train the model on the entire dataset but I kept facing the "CUDA out of memory" error. So I thought of reducing the batch size. This is my config file:
"concept_to_control": "domain", "values_to_control": ["airline", "dvd", "electronics", "kitchen"], "splits_for_training": ["unlabeled"], "splits_for_augmentations": ["train", "validation"], "t5_model_name": "t5-base", "max_seq_len": 96, "seed": 42, "batch_size": 16, "min_occurrences": 10, "smoothing": [1, 5, 7], "n_orientations": 4, "top_occurrences_threshold": 100, "n_grams": 3, "masked_output": false, "threshold": 0.08, "top_n": 0.05, "noise": 0.05, "unknown_orientation_p": 0.05, "generator_epochs": 5, "generator_classifier_epochs": 3, "generator_classifier_batch_size": 16, "fast_dev": false
Even now it throws the following error when I try to train the T5 Generator model Token indices sequence length is longer than the specified maximum sequence length for this model (67807 > 512). Running this sequence through the model will result in indexing errors