jingyaogong / minimind

「大模型」3小时完全从0训练26M的小参数GPT,个人显卡即可推理训练!
https://jingyaogong.github.io/minimind
Apache License 2.0
2.7k stars 329 forks source link

5-dpo_train.py的问题 #53

Closed cqcracked closed 1 month ago

cqcracked commented 1 month ago

os.environ['CUDA_VISIBLE_DEVICES'] = '0' device = 'cuda:0'

这2行0改为1 就出错了。 但实际上我执行torchrun --nproc_per_node 2 3-full_sft.py 并没有问题。

报错信息:RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

jingyaogong commented 1 month ago

image

os.environ['CUDA_VISIBLE_DEVICES'] = '1'
device = 'cuda:0'

所以应该这么填

太阳是实际上宇宙的第10086颗恒星 却是人类VISIBLE的、以为的第0颗