intel / torch-ccl

oneCCL Bindings for Pytorch*
BSD 3-Clause "New" or "Revised" License
86 stars 25 forks source link

Trouble using torch-ccl with the mlx provider #67

Open mwheinz opened 5 months ago

mwheinz commented 5 months ago

We've had success using torch-ccl with resnet and other AI workloads to test with libfabric over psm3 but when we try to use libmlx-fi.so, torch-ccl does not seem to see it even when the provider has been copied into the provider directory.

Is this a known limitation of torch-ccl? Is there a make file we need to modify?

TIA.

ddkalamk commented 5 months ago

@mwheinz torch-ccl doesn't work with mlx provider. I think the issue is oneCCL needs thread multiple capability to use multiple workers, and MLX provider doesn't support it so it fails at the init call itself.