huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.67k stars 26.7k forks source link

typeerror: textinputsequence must be str #13115

Closed justwangqian closed 3 years ago

justwangqian commented 3 years ago

Describe the bug

I use dataset.map() to encode the data, but get this problem.

I use the code to transfer data to local csv files,.As i use colab, local files are more convenient.

dataset = load_dataset(path='glue', name='mnli') keys = ['train', 'validation_matched','validation_mismatched'] for k in keys: result = [] for record in dataset[k]: c1, c2, c3 = record['premise'], record['hypothesis'], record['label'] if c1 and c2 and c3 in {0,1,2}: result.append(c1,c2,c3)) result = pd.DataFrame(result, columns=['premise','hypothesis','label']) result.tocsv('mnli'+k+'.csv',index=False)

then I process data like this ,and get the issue.

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def encode(batch): return tokenizer(batch['premise'], batch['hypothesis'], max_length=MAXLEN, padding='max_length', truncation=True )

train_dict = load_dataset('csv', data_files=train_data_path) train_dataset = train_dict['train'] train_dataset = train_dataset.map(encode, batched=True)

Expected results

encode the data successfully.

Actual results

TypeError Traceback (most recent call last)

in () 5 val_dataset = val_dict['train'] 6 ----> 7 train_dataset = train_dataset.map(encode, batched=True) 8 val_dataset = val_dataset.map(encode, batched=True) 9 /usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in map(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc) 1680 new_fingerprint=new_fingerprint, 1681 disable_tqdm=disable_tqdm, -> 1682 desc=desc, 1683 ) 1684 else: /usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs) 183 } 184 # apply actual function --> 185 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) 186 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 187 # re-apply format to the output /usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py in wrapper(*args, **kwargs) 395 # Call actual function 396 --> 397 out = func(self, *args, **kwargs) 398 399 # Update fingerprint of in-place transforms + update in-place history of transforms /usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc) 2018 indices, 2019 check_same_num_examples=len(input_dataset.list_indexes()) > 0, -> 2020 offset=offset, 2021 ) 2022 except NumExamplesMismatch: /usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples, offset) 1904 effective_indices = [i + offset for i in indices] if isinstance(indices, list) else indices + offset 1905 processed_inputs = ( -> 1906 function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs) 1907 ) 1908 if update_data is None: in encode(batch) 6 max_length=MAXLEN, 7 padding='max_length', ----> 8 truncation=True 9 ) /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs) 2383 return_length=return_length, 2384 verbose=verbose, -> 2385 **kwargs, 2386 ) 2387 else: /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs) 2568 return_length=return_length, 2569 verbose=verbose, -> 2570 **kwargs, 2571 ) 2572 /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose) 406 batch_text_or_text_pairs, 407 add_special_tokens=add_special_tokens, --> 408 is_pretokenized=is_split_into_words, 409 ) 410 TypeError: TextInputSequence must be str ## Environment info - `datasets` version:1.11.0 - Platform:colab - Python version:3.7 - PyArrow version: @lhoestq
justwangqian commented 3 years ago

by the way ,the same code works when i process the xnli dataset.

albertvillanova commented 3 years ago

Hi @justwangqian,

I think your issue is with the transformers library. I guess you should update it, but I prefer transferring your issue to them, so that they can keep the record.

Feel free to reopen an issue in datasets if there is finally a bug here. :)

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

devinraina commented 7 months ago

having same issue

justwangqian commented 7 months ago

您的邮件我已经收到

devinraina commented 7 months ago

Update: It was due to the data I was passing in the vector database.