oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
183 stars 66 forks source link

device_dir hard code to /dev/dri/by-path/ #121

Open zhouyu5 opened 1 month ago

zhouyu5 commented 1 month ago

Hi, developers, I found the variable device_dir is hard coded to /dev/dri/by-path/ (see code here), in most case, this will not be a problem, but in some case, it may not work well.

Take mine as example, I set up the environment in docker container, and I start the container with the following command: docker run --device=/dev/dri ..., then after launch the training, I will met the problem: RuntimeError: oneCCL: ze_fd_manager.cpp:143 init_device_fds: EXCEPTION: opendir failed: could not open device directory, since the device_dir is hard coded to /dev/dri/by-path/, but the docker container only map the /dev/dri from host machine without map the subfolder by-path, thus there is not such a /dev/dri/by-path/ in container, thus causing the problem.

I am not sure if I explain it clearly. Could you please share some of your thoughts of the problem?

nikitaxgusev commented 4 weeks ago

Hello @zhouyu5, thanks. Your case makes sense to us, we would like to implement an environment variable through which you can specify the path to device_dir, but by default it'll be standard path which is /dev/dri/by-path/. It will help to address your issue.

zhouyu5 commented 4 weeks ago

Thanks, sounds great.