We've had success using torch-ccl with resnet and other AI workloads to test with libfabric over psm3 but when we try to use libmlx-fi.so, torch-ccl does not seem to see it even when the provider has been copied into the provider directory.
Is this a known limitation of torch-ccl? Is there a make file we need to modify?
@mwheinz torch-ccl doesn't work with mlx provider. I think the issue is oneCCL needs thread multiple capability to use multiple workers, and MLX provider doesn't support it so it fails at the init call itself.
We've had success using torch-ccl with resnet and other AI workloads to test with libfabric over psm3 but when we try to use libmlx-fi.so, torch-ccl does not seem to see it even when the provider has been copied into the provider directory.
Is this a known limitation of torch-ccl? Is there a make file we need to modify?
TIA.