huggingface / transfer-learning-conv-ai

šŸ¦„ State-of-the-Art Conversational AI with Transfer Learning
MIT License
1.74k stars 430 forks source link

dataset tokenization returns None values #99

Closed dnns92 closed 3 years ago

dnns92 commented 3 years ago

My System:

What happened?

dataset["train"][0] = {'personality': [[11, None, 14594, None, 571, None, 30678, None, 5279, 14, None, 1], [11, None, 14594, None, 571, None, 581, None, 1108, 4407, None, 1], [11, None, 14594, None, 571, None, 3895, 3 ..

which is not supposed to happen I guess. Someone else having the same errors?

What I tried so far:

I tried to play around with the arguments a bit, having no luck so far.

Full Error:

  File "C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py", line 157, in <module>
    run()
  File "C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py", line 139, in run
    logger.info("Selected personality: %s", tokenizer.decode(chain(*personality)))
  File "C:\Users\nano\.conda\envs\convAI\lib\site-packages\transformers\tokenization_utils.py", line 1528, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "C:\Users\nano\.conda\envs\convAI\lib\site-packages\transformers\tokenization_utils.py", line 1498, in convert_ids_to_tokens
    index = int(index)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Full Scrollback:

C:\Users\nano\.conda\envs\convAI\python.exe -- C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py --model gpt2 --model_checkpoint gpt2
2021-01-17 19:24:14.047383: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-01-17 19:24:14.047530: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py:Namespace(dataset_cache='./dataset_cache', dataset_path='', device='cpu', max_history=2, max_length=20, min_length=1, model='gpt2', model_checkpoint='gpt2', no_sample=False, seed=0, temperature=0.7, top_k=0, top_p=0.9)
INFO:C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py:Get pretrained model and tokenizer
INFO:filelock:Lock 2441686875720 acquired on C:\Users\nano\.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json not found in cache or force_download set to True, downloading to C:\Users\nano\.cache\torch\transformers\tmp07zony6g
Downloading: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 1.04M/1.04M [00:01<00:00, 975kB/s]
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json in cache at C:\Users\nano\.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
INFO:transformers.file_utils:creating metadata file for C:\Users\nano\.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
INFO:filelock:Lock 2441686875720 released on C:\Users\nano\.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock
INFO:filelock:Lock 2441686874320 acquired on C:\Users\nano\.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt not found in cache or force_download set to True, downloading to C:\Users\nano\.cache\torch\transformers\tmp_mcocmdo
Downloading: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 456k/456k [00:00<00:00, 653kB/s]
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt in cache at C:\Users\nano\.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
INFO:transformers.file_utils:creating metadata file for C:\Users\nano\.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
INFO:filelock:Lock 2441686874320 released on C:\Users\nano\.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at C:\Users\nano\.cache\torch\transformers\f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at C:\Users\nano\.cache\torch\transformers\d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
INFO:filelock:Lock 2441686874992 acquired on C:\Users\nano\.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json not found in cache or force_download set to True, downloading to C:\Users\nano\.cache\torch\transformers\tmp0c9u5l10
Downloading: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 665/665 [00:00<00:00, 626kB/s]
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json in cache at C:\Users\nano\.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e
INFO:transformers.file_utils:creating metadata file for C:\Users\nano\.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e
INFO:filelock:Lock 2441686874992 released on C:\Users\nano\.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e.lock
INFO:transformers.configuration_utils:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at C:\Users\nano\.cache\torch\transformers\4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.db13c9bc9c7bdd738ec89e069621d88e05dc670366092d809a9cbcac6798e24e
INFO:transformers.configuration_utils:Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": null,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 50257
}

INFO:filelock:Lock 2440366262592 acquired on C:\Users\nano\.cache\torch\transformers\4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1.lock
INFO:transformers.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin not found in cache or force_download set to True, downloading to C:\Users\nano\.cache\torch\transformers\tmpf1zbtwlp
Downloading: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 548M/548M [00:30<00:00, 17.8MB/s]
INFO:transformers.file_utils:storing https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin in cache at C:\Users\nano\.cache\torch\transformers\4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
INFO:transformers.file_utils:creating metadata file for C:\Users\nano\.cache\torch\transformers\4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
INFO:filelock:Lock 2440366262592 released on C:\Users\nano\.cache\torch\transformers\4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1.lock
INFO:transformers.modeling_utils:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin from cache at C:\Users\nano\.cache\torch\transformers\4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
INFO:transformers.tokenization_utils:Adding <bos> to the vocabulary
INFO:transformers.tokenization_utils:Assigning <bos> to the bos_token key of the tokenizer
INFO:transformers.tokenization_utils:Adding <eos> to the vocabulary
INFO:transformers.tokenization_utils:Assigning <eos> to the eos_token key of the tokenizer
INFO:transformers.tokenization_utils:Adding <pad> to the vocabulary
INFO:transformers.tokenization_utils:Assigning <pad> to the pad_token key of the tokenizer
INFO:transformers.tokenization_utils:Adding <speaker1> to the vocabulary
INFO:transformers.tokenization_utils:Adding <speaker2> to the vocabulary
INFO:transformers.tokenization_utils:Assigning ['<speaker1>', '<speaker2>'] to the additional_special_tokens key of the tokenizer
INFO:C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py:Sample a personality
INFO:C:\Users\nano\Documents\repos\transfer-learning-conv-ai\utils.py:Load tokenized dataset from cache at ./dataset_cache_GPT2Tokenizer
Traceback (most recent call last):
  File "C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py", line 157, in <module>
    run()
  File "C:/Users/nano/Documents/repos/transfer-learning-conv-ai/interact.py", line 139, in run
    logger.info("Selected personality: %s", tokenizer.decode(chain(*personality)))
  File "C:\Users\nano\.conda\envs\convAI\lib\site-packages\transformers\tokenization_utils.py", line 1528, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "C:\Users\nano\.conda\envs\convAI\lib\site-packages\transformers\tokenization_utils.py", line 1498, in convert_ids_to_tokens
    index = int(index)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
dnns92 commented 3 years ago

So for anyone having this problem: Completely removing the whole directory and pulling it again solved the problem for me. Appears that maybe my download failed halfway through leading to a somewhat useful file