Closed justwangqian closed 3 years ago
by the way ,the same code works when i process the xnli dataset.
Hi @justwangqian,
I think your issue is with the transformers
library. I guess you should update it, but I prefer transferring your issue to them, so that they can keep the record.
Feel free to reopen an issue in datasets
if there is finally a bug here. :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
having same issue
您的邮件我已经收到
Update: It was due to the data I was passing in the vector database.
Describe the bug
I use dataset.map() to encode the data, but get this problem.
I use the code to transfer data to local csv files,.As i use colab, local files are more convenient.
dataset = load_dataset(path='glue', name='mnli') keys = ['train', 'validation_matched','validation_mismatched'] for k in keys: result = [] for record in dataset[k]: c1, c2, c3 = record['premise'], record['hypothesis'], record['label'] if c1 and c2 and c3 in {0,1,2}: result.append(c1,c2,c3)) result = pd.DataFrame(result, columns=['premise','hypothesis','label']) result.tocsv('mnli'+k+'.csv',index=False)
then I process data like this ,and get the issue.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def encode(batch): return tokenizer(batch['premise'], batch['hypothesis'], max_length=MAXLEN, padding='max_length', truncation=True )
train_dict = load_dataset('csv', data_files=train_data_path) train_dataset = train_dict['train'] train_dataset = train_dataset.map(encode, batched=True)
Expected results
encode the data successfully.
Actual results
TypeError Traceback (most recent call last)