microsoft / vscode-ai-toolkit

MIT License
865 stars 36 forks source link

[FEAT REQ] pad_to_max_len -> Union[bool | PadToken] #35

Closed kfsone closed 5 months ago

kfsone commented 6 months ago

Trying to enable padding, you get a wall of errors, and while I absolutely recognize this is an Olive issue it's far less likely to get traction from an end user directly to Olive than from a bulk consumer like ai-studio.

Basically, to simplify solving for:

line 45, in pre_process
    return self.config.pre_process(dataset, **self.config.pre_process_params)
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/olive/data/component/pre_process_data.py", line 180, in text_generation_huggingface_pre_process
    return text_gen_corpus_pre_process(dataset, tokenizer, all_kwargs)
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/olive/data/component/text_generation.py", line 342, in text_gen_corpus_pre_process
    batched_input_ids = batch_tokenize_text(text_list, tokenizer, args)
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/olive/data/component/text_generation.py", line 522, in batch_tokenize_text
    batched_encodings = tokenizer(
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2790, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2876, in _call_one
    return self.batch_encode_plus(
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3058, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/home/oliver/miniconda3/envs/mistral-7b-env/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2695, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

allowing

                "component_kwargs": {
                    "pre_process_data": {
                        "dataset_type": "corpus",
                        "text_cols": ["q","a"],
                        "text_template": "### Question:\n{q}\n### Answer:\n{a}",
                        "corpus_strategy": "line-by-line",
                        "source_max_len": 2048,
                        "pad_to_max_len": '?PAD!',
                        "use_attention_mask": true
                    }
                }
acube3 commented 5 months ago

Did you try raising the issue here: https://github.com/microsoft/Olive/issues