源码安装swift报错

zhangfan-algo commented 6 months ago

硬件配置：4台8*a800

zhangfan-algo commented 6 months ago

怀疑是多线程冲突引起的

Jintao-Huang commented 6 months ago

卸载干净再装一下呗

zhangfan-algo commented 6 months ago

Traceback (most recent call last): File "/mnt/pfs/zhangfan/homework_correction/swift_0429/examples/pytorch/llm/llm_sft.py", line 7, in output = sft_main() File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/utils/run_utils.py", line 31, in x_main result = llm_x(args, *kwargs) File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/llm/sft.py", line 265, in llm_sft trainer.train(training_args.resume_from_checkpoint) File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/trainers/trainers.py", line 54, in train res = super().train(args, **kwargs) File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train return inner_training_loop( File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop for step, inputs in enumerate(epoch_iterator): File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in iter current_batch = next(dataloader_iter) File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/apps1/zhangfan/anaconda3/envs/swift/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/llm/utils/template.py", line 1050, in data_collator res = super().data_collator(batch, padding_to) File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/llm/utils/template.py", line 436, in data_collator input_ids = [torch.tensor(b['input_ids']) for b in batch] File "/mnt/pfs/zhangfan/homework_correction/swift_0429/swift/llm/utils/template.py", line 436, in input_ids = [torch.tensor(b['input_ids']) for b in batch] KeyError: 'input_ids' 目前先在单台机器上试跑有这个报错

zhangfan-algo commented 6 months ago

数据集格式: {"query":"这是学生书写的数字和数学公式相关内容。请你准确说出图片中手写体内容是什么.数学公式用latex表达。你的输出格式必须是：图片中手写体内容是:XXX.let us think step by step","response":"图片中手写体内容是:(数学公式用latex公式表达)\n\n"+str(label),"images":[file_path]}

Jintao-Huang commented 6 months ago

什么模型呀

zhangfan-algo commented 6 months ago

internvl-chat-v1_5

hjh0119 commented 6 months ago

方便提供下sft命令和数据样例？

hjh0119 commented 6 months ago

--max_length 1024 太小了，图像部分vit的embeds长度一般都超过1024了。

hjh0119 commented 6 months ago

建议设到2048以上

zhangfan-algo commented 6 months ago

好的我试试

zhangfan-algo commented 6 months ago

大佬可以了还想问下我们目前支持epoch的方式保存模型不

hjh0119 commented 6 months ago

大佬可以了还想问下我们目前支持epoch的方式保存模型不

@Jintao-Huang

tastelikefeet commented 6 months ago

目前不支持，仅支持step方式，可以考虑将--save_steps设置为和一个epoch相匹配

zhangfan-algo commented 6 months ago

卸载干净再装一下呗大佬虚拟环境之前没有装swift 重新跑了一下还是报错了

modelscope / ms-swift

源码安装swift报错 #845