dataset = datasets.load_from_disk(args.dataset_path)
print(f"\n{len(dataset)=}\n")
for key in dataset[0]:
print(key)
但是用data_collator读取数据时报错
File "/checkpoint/binary/train_package/finetune.py", line 125, in <module>
main()
File "/checkpoint/binary/train_package/finetune.py", line 118, in main
trainer.train()
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/root/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/envs/python3.8.13/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/checkpoint/binary/train_package/finetune.py", line 28, in data_collator
seq_len = feature["seq_len"]
KeyError: 'seq_len'
tokenize_dataset_rows生成训练数据后 直接从datasets可以输出input_ids和seq_len
但是用data_collator读取数据时报错
是包的版本不对吗?还是有其他问题?