the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors.

openai / consistency_models

Official repo for consistency models.

MIT License

6.02k stars 409 forks source link

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

Open kuailexiaohunzi opened 4 weeks ago

kuailexiaohunzi commented 4 weeks ago

When using CT mode for training, the following errors occur. Does anyone know how to solve them

RICKand-MORTY commented 3 weeks ago

Maybe the version of pytorch or cuda is incorrect

kuailexiaohunzi commented 3 weeks ago

Maybe the version of pytorch or cuda is incorrect

The pytorch version is 1.13 and cuda is 11.7, which matches

RICKand-MORTY commented 3 weeks ago

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

kuailexiaohunzi commented 3 weeks ago

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

不是，单卡，我甚至没有用mpiexec -n这个命令

RICKand-MORTY commented 3 weeks ago

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

kuailexiaohunzi commented 3 weeks ago

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

OK，之后试试

kuailexiaohunzi commented 2 weeks ago

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了，但还是不行，报同样的错误

RICKand-MORTY commented 2 weeks ago

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了，但还是不行，报同样的错误

在/etc/profile里添加，作为系统环境变量

kuailexiaohunzi commented 2 weeks ago

嗷嗷，OK

RICKand-MORTY commented 2 weeks ago

在/etc/profile里添加，作为系统环境变量

记得保存后用source刷新一下

kuailexiaohunzi commented 2 weeks ago

OK，感谢