openai / gpt-2-output-dataset

Dataset of GPT-2 outputs for research in detection, biases, and more
MIT License
1.93k stars 548 forks source link

Training code fails on 0 length inputs (which are in several datasets included by the author/used in the report) #51

Open veenapaddy opened 1 year ago

veenapaddy commented 1 year ago

Some of the training data (specifically, the GPT2 generated datasets) contain texts of length 0. This causes training (and would cause inference) to error out. Is this expected? Please see the error message below:

Loading data/webtext.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 49837.49it/s]
Loading data/webtext.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48536.65it/s]
Loading data/webtext.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48406.80it/s]
Loading data/xl-1542M.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 46902.35it/s]
Loading data/xl-1542M.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45678.24it/s]
Loading data/xl-1542M.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45654.67it/s]
Epoch 1:  10%|█         | 2098/20834 [22:20<3:19:33,  1.56it/s, acc=0.856, loss=0.297]
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/local/home/openai_code/detector/train.py", line 324, in <module>
    run(**vars(args))
  File "/local/home/openai_code/detector/train.py", line 255, in run
    train_metrics = train(model, optimizer, device, train_loader, f'Epoch {epoch}')
  File "/local/home/openai_code/detector/train.py", line 108, in train
    for texts, masks, labels in loop:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/detector/dataset.py", line 60, in __getitem__
    tokens = self.tokenizer.encode(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1427, in encode
    **kwargs,
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1569, in encode_plus
    first_ids = get_input_ids(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1541, in get_input_ids
    tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1265, in tokenize
    text = self.prepare_for_tokenization(text, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_roberta.py", line 239, in prepare_for_tokenization
    if add_prefix_space and not text[0].isspace():
IndexError: string index out of range

The following datasets contain entries with length of 0:

./data/large-762M.train.jsonl
./data/large-762M.valid.jsonl
./data/medium-345M.train.jsonl
./data/small-117M100.valid.jsonl
./data/small-117M.test.jsonl
./data/small-117M.train.jsonl
./data/small-117M.valid.jsonl
./data/xl-1542M.train.jsonl