Issues tokenizing dataset

tokenatlas commented 11 months ago

Thanks for open-sourcing this!

I am trying to follow the instructions for tokenizing the data, but it fails with the stack trace below. I'm just using two lines of dummy data. Any ideas where this issue is coming from? Thanks!

python -m ochat.data.generate_dataset --model-type "openchat_v3.2_mistral" --model-path "imone/Mistral_7B_with_EOT_token" --in-files data.jsonl --out-prefix pretok.tok
...
...
...
(convert_conversation_batch pid=13365) Chunk finish                                                                                                          (convert_conversation_batch pid=13205) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [repeated 28x across cluster]                                                                                                                                Traceback (most recent call last):                                                                                                                             File "/opt/conda/envs/ptca/lib/python3.8/runpy.py", line 194, in _run_module_as_main                                                                           return _run_code(code, main_globals, None,                                                                                                                 File "/opt/conda/envs/ptca/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 167, in <module>
    generate_dataset(**vars(args))
  File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 149, in generate_dataset
    generate_split(model_type, model_path, train_conversations, "train", out_prefix, per_sequence_loss)
  File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 131, in generate_split
    parquet.write_table(pyarrow.concat_tables([ray.get(handle) for handle in handles]), f"{out_prefix}.{split_name}.parquet")
  File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 131, in <listcomp>
    parquet.write_table(pyarrow.concat_tables([ray.get(handle) for handle in handles]), f"{out_prefix}.{split_name}.parquet")
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::convert_conversation_batch() (pid=13368, ip=10.4.66.23)
  File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 78, in convert_conversation_batch
    tokens_list, weights_list = conv_template.tokenize_conversations(batch, inference=False, seq_level_weight=per_sequence_loss)
  File "/tmp/och/openchat/ochat/config/conversation_template.py", line 61, in tokenize_conversations
    sys_mappings = dict(zip(sys_mappings, self._tokenize(sys_mappings)))
  File "/tmp/och/openchat/ochat/config/conversation_template.py", line 42, in _tokenize
    return self.tokenizer(strings, split_special_tokens=ignore_special, return_attention_mask=False, add_special_tokens=False).input_ids
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2798, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2884, in _call_one
    return self.batch_encode_plus(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3075, in batch_encode_plus
    return self._batch_encode_plus(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 807, in _batch_encode_plus
    batch_outputs = self._batch_prepare_for_model(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 879, in _batch_prepare_for_model
    batch_outputs = self.pad(
  File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3214, in pad
    raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

suinkim28 commented 11 months ago

It might be the case that the provided dataset contains only small samples. Setting the number of splits as 1 solved the issue.

https://github.com/imoneoi/openchat/blob/master/ochat/data/generate_dataset.py#L128

SyedSherjeel commented 10 months ago

@tokenatlas did u resolve this issues or found any work around? I am facing same issue

imoneoi / openchat

Issues tokenizing dataset #109