Finetuning on Ray and CPU causes Runtime error

premdass commented 3 months ago

Ray version : ray 2.10 llm-on-ray : latest from main branch command used to run : llm_on_ray-finetune --config_file llm-on-ray/llm_on_ray/finetune/finetune.yaml

RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed

harborn commented 3 months ago

maybe you should oneCCL environment variables. just calling:

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

premdass commented 3 months ago

oneccl environment has been sourced correctly before ray starts (i can see it in the worker startup logs)

harborn commented 3 months ago

which version of oneccl-bind-pt you installed? Here is my used version:

oneccl-bind-pt              2.2.0+cpu

KepingYan commented 3 months ago

Hi @premdass , is the ray cluster started on a single node or multiple nodes? Also, could you remove these two parameters

"FI_TCP_IFACE": "lo",
"FI_PROVIDER": "tcp",

in llm_on_ray/finetune/finetune.py and try again?

premdass commented 3 months ago

@harborn @KepingYan : Thanks for responding. Please find the env details

oneccl-bind-pt = 2.2.0+cpu Ray = 2.10 K8s = 1.29

I have run the finetune.py without the FI_TCP_IFACE and FI_PROVIDER params and still seeing the same runtime i it error. all the ports between the worker nodes are open as well.

i enabled the ccl debug logging and seeing below error in worker nodes

2024:06:04-01:54:41:( 3387) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage 2024:06:04-01:54:41:( 3387) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine 2024:06:04-01:54:41:( 3387) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.141.71 2024:06:04-01:54:41:( 3387) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 0 2024:06:04-01:54:41:( 3387) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1 RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed 2024:06:04-01:56:55:( 3161) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0 2024:06:04-01:56:55:( 3161) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fc4543bca80 2024:06:04-01:56:55:( 3161) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fc4543bf040

premdass commented 3 months ago

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

xwu99 commented 3 months ago

Just to add context, i am running this in kubernetes / container environment. Hence Ray workers are pods/containers. Do i have to do something specific to enabel mpi on container environment ?

It looks the error was from oneCCL init transport failed, the finetuning code works on physical nodes, maybe the network interfaces are different in K8S that cause the oneCCL failure. Could you try other FI_PROVIDER? Otherwise, this could be specific issue for oneCCL running on K8S.

@premdass could you share the full log for CCL_DEBUG that we can check what happened when init oneCLL?

@mshiryaev Hi, is this something known to you? Do you know if torch-ccl need special config on K8S?

xwu99 commented 3 months ago

@premdass I tried in my local K8S, it doesn't fail as you did. could you just set "FI_PROVIDER": "tcp" and remove "FI_TCP_IFACE": "lo". Pls make sure to rebuild the docker images to include the code updates for K8S.

Could you share your full log with CCL_DEBUG enabled so that we can know what interface and provider were selected ? How many network interfaces you have for each container ?

premdass commented 2 months ago

Apologies for delayed responde @xwu99 . i have enabled debug logs for ccl and tcp as FI_PROVIDER. Below are the logs when i grep for ccl from ray worker nodes

2024:06:25-09:23:14:( 5098) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi 2024:06:25-09:23:14:( 5098) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2024:06:25-09:23:14:( 5098) |CCL_INFO| process launcher: hydra, local_proc_idx: -1, local_proc_count: -1 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| ofi_api_wrapper.cpp:38 ofi_api_init: OFI lib path: libfabric.so.1 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| mpi_api_wrapper.cpp:40 mpi_api_init: MPI lib path: libmpi.so.12 2024:06:25-09:23:14:( 5098) |CCL_INFO| OS info: { Linux ray-cpu-cluster-train-kuberay-worker-workergroup-8j2mb 5.10.218-208.862.amzn2.x86_64 #1 SMP Tue Jun 4 16:52:10 UTC 2024 x86_64 } 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| datatype.cpp:69 ccl_datatype_storage: create datatype_storage 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120) 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs_server.hpp:66 put: read/write error: Broken pipe 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:108 kvs_get_value_by_name_key: client: get_value 2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:319 get_local_kvs_id: failed to get local kvs id 2024:06:25-09:25:24:( 5098) |CCL_ERROR| pmi_resizable_simple_internal.cpp:65 pmrt_init: failed to get local id 2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:268 init_transport: pmi init failed 2024:06:25-09:25:24:( 5098) |CCL_ERROR| atl_ofi_comm.cpp:79 atl_ofi_comm: condition init_transport(true) == ATL_STATUS_SUCCESS failed 2024:06:25-09:25:24:( 5098) |CCL_DEBUG| communicator_impl.hpp:115 create_communicator: size 2, rank 1 2024:06:25-09:25:24:( 5098) |CCL_DEBUG| atl_ofi_comm.cpp:265 init_transport: init atl, requested ep_count 1 RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed 2024:06:25-09:25:26:( 4890) |CCL_DEBUG| buffer_cache.cpp:60 clear: clear buffer cache: size: 0 2024:06:25-09:25:26:( 4890) |CCL_DEBUG| ofi_api_wrapper.cpp:48 ofi_api_fini: close OFI lib: handle: 0x7fab7ee19d20 2024:06:25-09:25:26:( 4890) |CCL_DEBUG| mpi_api_wrapper.cpp:50 mpi_api_fini: close MPI lib: handle: 0x7fab7ee1d2d0

premdass commented 2 months ago

Bit more details to add, i have 2 interfaces lo, and eth0 and i tried with both the names and end up in similar error. I am trying to run distributed training with 2 ray worker nodes. Does it need any entries in hostfile or something to find the another worker node ?

xwu99 commented 2 months ago

2024:06:25-09:23:14:( 5098) |CCL_DEBUG| hwloc_wrapper.cpp:69 ccl_hwloc_wrapper: hwloc root object: type: Machine > 2024:06:25-09:23:14:( 5098) |CCL_DEBUG| internal_kvs.cpp:323 fill_local_host_ip: use ipv4: 100.64.183.160 > 2024:06:25-09:25:24:( 5098) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (130) >= limit (120)

This shows oneCCL doesn't init correctly due to connection timeout. It seems a network problem. How do you setup your Ray cluster in K8S? There is KubeRay project to help setup Ray cluster properly for K8S.

premdass commented 2 months ago

kuberay is being used to setup the clusters in this case.. i need to dig why the ccl cannot init.. any pointers please ?

intel / llm-on-ray

Finetuning on Ray and CPU causes Runtime error #242