Open lintangsutawika opened 2 years ago
Looks like the coordinator address
isn't set properly.
coordinator_address: :29500
process_count: 2
process_index: 0
Could you recheck ${SLURM_LAUNCH_NODE_IPADDR}
value in your env?
(it's getting picked correctly for process_index=1
but not for process_index=0
)
Yeah, looks like SLURM_LAUNCH_NODE_IPADDR
isn't the machine's actual address. I tested with "127.0.0.1:29500"
as the coordinator_address and running this actually doesn't stop at segmentation fault
singularity exec --nv --bind /fsx:/fsx /fsx/lintangsutawika/t5x-env.sif \
python ${T5X_DIR}/t5x/train.py \
--gin_search_paths=${PROJECT_DIR} \
--gin_file="config-base.gin" \
--gin.MODEL_DIR=\"${MODEL_DIR}\" \
--gin.USE_CACHED_TASKS=False \
--alsologtostderr \
--multiprocess_gpu \
--coordinator_address="${ADDR}" \
--process_count "${SLURM_NPROCS}" \
--process_index 0 \
& \
singularity exec --nv --bind /fsx:/fsx /fsx/lintangsutawika/t5x-env.sif \
python ${T5X_DIR}/t5x/train.py \
--gin_search_paths=${PROJECT_DIR} \
--gin_file="config-base.gin" \
--gin.MODEL_DIR=\"${MODEL_DIR}\" \
--gin.USE_CACHED_TASKS=False \
--alsologtostderr \
--multiprocess_gpu \
--coordinator_address="${ADDR}" \
--process_count "${SLURM_NPROCS}" \
--process_index 1
However, now I see that the script only detects 1 GPU per process? Also, a new error.
/usr/local/lib/python3.8/site-packages/jax/_src/lib/xla_bridge.py:556: UserWarning: jax.host_count has been renamed to jax.process_count. This alias will eventually be removed; please update your code.
warnings.warn(
I1012 18:00:11.249953 140565072195584 partitioning.py:331] global_mesh axis_names: ('data', 'model')
I1012 18:00:11.250089 140565072195584 partitioning.py:332] global_mesh devices: [[GpuDevice(id=0, process_index=0) GpuDevice(id=1, process_index=1)]]
2022-10-12 18:00:12.262847: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 1 failed: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: unhandled cuda error
2022-10-12 18:00:12.272575: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: unhandled cuda error
Traceback (most recent call last):
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 749, in <module>
gin_utils.run(main)
File "/fsx/lintangsutawika/t5x/t5x/gin_utils.py", line 107, in run
app.run(
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 710, in main
_main(argv)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 745, in _main
train_using_gin()
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 251, in train
train_iter = get_dataset_fn(train_dataset_cfg, ds_shard_id, num_ds_shards,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1371, in get_dataset
return get_dataset_inner(cfg, shard_info, feature_converter_cls, seed,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1385, in get_dataset_inner
multihost_assert_equal(
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 566, in multihost_assert_equal
multihost_utils.assert_equal(input_tree, fail_message)
File "/usr/local/lib/python3.8/site-packages/jax/experimental/multihost_utils.py", line 169, in assert_equal
expected = broadcast_one_to_all(in_tree)
File "/usr/local/lib/python3.8/site-packages/jax/experimental/multihost_utils.py", line 75, in broadcast_one_to_all
in_tree = jax.device_get(_psum(in_tree))
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: unhandled cuda error
In call to configurable 'train' (<function train at 0x7f2f3dd08670>)```
Could you append and env variable NCCL_DEBUG=INFO
before the python command and run that, that'd let us see some debug info from NCCL
However, now I see that the script only detects 1 GPU per process?
That's because recently, jax.distributed.initialize
was modified to run 1 GPU per process as the default setting. To let the process see all the GPUs in the node, you could pass an additional argument local_device_ids=list(range(num_gpus)]
to jax.distributed.initialize
.
Looks like it's an out of memory
issue? I reduced the batch size with size of 1 but the problem still persists.
gpu-st-p4d-24xlarge-25:15647:15647 [0] external/nccl_archive/src/enqueue.cc:128 NCCL WARN Cuda failure 'out of memory'
gpu-st-p4d-24xlarge-25:15647:15647 [0] NCCL INFO Bootstrap : Using eth0:172.31.224.66<0>
gpu-st-p4d-24xlarge-25:15647:15647 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu-st-p4d-24xlarge-25:15647:15647 [0] NCCL INFO cudaDriverVersion 11040
NCCL version 2.13.4+cudaCUDA_MAJOR.CUDA_MINOR
gpu-st-p4d-24xlarge-25:15647:15647 [0] external/nccl_archive/src/init.cc:1075 NCCL WARN Cuda failure 'out of memory'
gpu-st-p4d-24xlarge-25:15647:15647 [0] NCCL INFO external/nccl_archive/src/init.cc:1106 -> 1
2022-10-13 04:16:09.849660: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: unhandled cuda error
Traceback (most recent call last):
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 754, in <module>
gin_utils.run(main)
File "/fsx/lintangsutawika/t5x/t5x/gin_utils.py", line 107, in run
app.run(
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 710, in main
_main(argv)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 750, in _main
train_using_gin()
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 251, in train
train_iter = get_dataset_fn(train_dataset_cfg, ds_shard_id, num_ds_shards,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1371, in get_dataset
return get_dataset_inner(cfg, shard_info, feature_converter_cls, seed,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1385, in get_dataset_inner
multihost_assert_equal(
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 566, in multihost_assert_equal
multihost_utils.assert_equal(input_tree, fail_message)
File "/usr/local/lib/python3.8/site-packages/jax/experimental/multihost_utils.py", line 175, in assert_equal
expected = broadcast_one_to_all(in_tree)
File "/usr/local/lib/python3.8/site-packages/jax/experimental/multihost_utils.py", line 75, in broadcast_one_to_all
in_tree = jax.device_get(_psum(in_tree))
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: unhandled cuda error
In call to configurable 'train' (<function train at 0x7fd683eab9d0>)
Ack. This looks like a familiar issue. Could you try a few WAR meanwhile we figure out a fix for this:
t5x/train.py
file:
...
import tensorflow as tf
tf.config.experimental.set_visible_devices([], "GPU")
...
XLA_PYTHON_CLIENT_MEM_FRACTION=.XX
with a value less than 0.9
(which is default).Btw, which GPU are you running T5x on?
I'm using A100 (40GB). Added 1, but error still persists. Step 2 seems to alleviate the previous issue but is stopped at a new error.
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO cudaDriverVersion 11080
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO Bootstrap : Using eth0:172.31.224.130<0>
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO P2P plugin IBext
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/IB : No device found.
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/IB : No device found.
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO NET/Socket : Using [0]eth0:172.31.224.130<0> [1]eth1:172.31.236.189<0> [2]eth2:172.31.232.164<0> [3]eth3:172.31
.236.136<0>
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO Using network Socket
gpu-st-p4d-24xlarge-78:41557:41557 [0] external/nccl_archive/src/init.cc:511 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 101c0
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO external/nccl_archive/src/init.cc:1045 -> 5
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO external/nccl_archive/src/init.cc:1091 -> 5
gpu-st-p4d-24xlarge-78:41557:41557 [0] NCCL INFO external/nccl_archive/src/init.cc:1106 -> 5
2022-10-13 05:20:09.232905: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 1 failed: INTERNAL$ external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(comm.get(), nranks, id, rank) failed: invalid u$age
Traceback (most recent call last):
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 755, in <module>
gin_utils.run(main)
File "/fsx/lintangsutawika/t5x/t5x/gin_utils.py", line 107, in run
app.run(
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 711, in main
_main(argv)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 751, in _main
train_using_gin()
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/fsx/home-lintangsutawika/.local/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/fsx/lintangsutawika/architecture-objective/t5x/train.py", line 252, in train
train_iter = get_dataset_fn(train_dataset_cfg, ds_shard_id, num_ds_shards,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1371, in get_dataset
return get_dataset_inner(cfg, shard_info, feature_converter_cls, seed,
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 1385, in get_dataset_inner
multihost_assert_equal(
File "/fsx/lintangsutawika/t5x/t5x/utils.py", line 566, in multihost_assert_equal
multihost_utils.assert_equal(input_tree, fail_message)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/multihost_utils.py", line 175, in assert_equal
expected = broadcast_one_to_all(in_tree)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/multihost_utils.py", line 75, in broadcast_one_to_all
in_tree = jax.device_get(_psum(in_tree))
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:266: NCCL operation ncclCommInitRank(c
omm.get(), nranks, id, rank) failed: invalid usage
To confirm, you're still running 2 processes on 2 nodes with ntasks-per-node=1
and without any local_device_ids
argument to jax.distributed.initialize
?
Btw, this error is new! Could you share the gin config that you're using? It'd help me repro this on my end.
Next quick check - could you run the code with nodes=1
and ntasks-per-node=8
(let's see multiprocess behaviour on a single node)?
Also, OOC, you seem to be using slurm and then launching two processes on the same node in this comment above? If you use an srun
command, that'd launch the command on two separate nodes (processes). Is singularity exec
is doing something similar?
The new error seems to be a matter of running the different process on the same set of GPUs.
Running this works (two processed on the same node, but different set of GPUs). So the issue is likely as you said, on my side on how to properly assign each process to the correct node.
export ADDR="127.0.0.1:29500"
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO \
singularity exec --nv --bind /fsx:/fsx /fsx/lintangsutawika/t5x-env.sif \
python ${T5X_DIR}/t5x/train.py \
--gin_search_paths=${PROJECT_DIR} \
--gin_file="config-base.gin" \
--gin.MODEL_DIR=\"${MODEL_DIR}\" \
--gin.USE_CACHED_TASKS=False \
--alsologtostderr \
--multiprocess_gpu \
--coordinator_address="${ADDR}" \
--process_count 2 \
--process_index 0 \
& \
CUDA_VISIBLE_DEVICES=4,5,6,7 NCCL_DEBUG=INFO \
singularity exec --nv --bind /fsx:/fsx /fsx/lintangsutawika/t5x-env.sif \
python ${T5X_DIR}/t5x/train.py \
--gin_search_paths=${PROJECT_DIR} \
--gin_file="config-base.gin" \
--gin.MODEL_DIR=\"${MODEL_DIR}\" \
--gin.USE_CACHED_TASKS=False \
--alsologtostderr \
--multiprocess_gpu \
--coordinator_address="${ADDR}" \
--process_count 2 \
--process_index 1
Closing this as the jax-related issue seems solved for now.
Thanks sudhakarsingh27
I'd like to point out that although this worked with adding CUDA_VISIBLE_DEVICES
, that's not a recommended way now, given that jax.distributed.initialize
provides an argument to specify the GPU using local_devices_ids
. (It'd be great if you could confirm that your code works with the recommended way as well)
I can confirm. I experimented on splitting 1 node of 8 GPUs to 2 process with 4 GPUs each using the method you mention. The issue was on the SLURM side.
However, I found a related issue where it seems the more nodes I use, the slower the throughput is.
I've tested running on nodes of 2,4, and 8 (each node has 8 A100s). I also varied the number of process (I made it so that increasing num of process reduces number of GPU per process). The trend seems to be for batch size fixed, increasing the number of nodes reduces steps per second.
I'm using T5X, but this feels like it might be an issue more relevant in Jax. Could it be misconfiguration?
Hi. I'm working with Lintang on this issue.
The real problem isn't that it slows down when we increase the number of nodes while fixing the total number of GPUs, since this is likely due to the slow gradient aggregation of GPUs, which is expected. It is that significant slowdown occurs when we run with singularity and two nodes with 16 A100s in total compared with when we run without singularity and one node with 8 A100s in total.
We have 1.6 steps/sec in the former case and 4 steps/sec in the latter case. In both cases, the global batch size is 256, and we're using base sized T5. Lintang verified that, in the former case, the input batch is split to two batches of size 128, so we don't have issues like unsplit batch.
Hi, A few clarifying questions:
steps/s
) consistent across runs? seqs_per_second
and seqs_per_second_per_core
?num_partitions>1
)?global_mesh axis_names: ('data', 'model')
global_mesh devices: [[StreamExecutorGpuDevice(id=0, process_index=0)]
[StreamExecutorGpuDevice(id=1, process_index=1)]
[StreamExecutorGpuDevice(id=8, process_index=2)]
...
Sure thing.
num_partitions = 1
--2 Nodes 2 Process --
I1018 02:49:33.758280 140684459909120 partitioning.py:331] global_mesh axis_names: ('data', 'model')
I1018 02:49:33.758435 140684459909120 partitioning.py:332] global_mesh devices: [[StreamExecutorGpuDevice(id=0, process_index=0)]
[StreamExecutorGpuDevice(id=1, process_index=0)]
[StreamExecutorGpuDevice(id=2, process_index=0)]
[StreamExecutorGpuDevice(id=3, process_index=0)]
[StreamExecutorGpuDevice(id=4, process_index=0)]
[StreamExecutorGpuDevice(id=5, process_index=0)]
[StreamExecutorGpuDevice(id=6, process_index=0)]
[StreamExecutorGpuDevice(id=7, process_index=0)]
[StreamExecutorGpuDevice(id=8, process_index=1)]
[StreamExecutorGpuDevice(id=9, process_index=1)]
[StreamExecutorGpuDevice(id=10, process_index=1)]
[StreamExecutorGpuDevice(id=11, process_index=1)]
[StreamExecutorGpuDevice(id=12, process_index=1)]
[StreamExecutorGpuDevice(id=13, process_index=1)]
[StreamExecutorGpuDevice(id=14, process_index=1)]
[StreamExecutorGpuDevice(id=15, process_index=1)]]
-- 2 Nodes 4 Process --
I1018 04:01:34.842234 140331303997440 partitioning.py:331] global_mesh axis_names: ('data', 'model')
I1018 04:01:34.842025 140325101854720 partitioning.py:332] global_mesh devices: [[StreamExecutorGpuDevice(id=0, process_index=0)]
[StreamExecutorGpuDevice(id=1, process_index=0)]
[StreamExecutorGpuDevice(id=2, process_index=0)]
[StreamExecutorGpuDevice(id=3, process_index=0)]
[StreamExecutorGpuDevice(id=4, process_index=1)]
[StreamExecutorGpuDevice(id=5, process_index=1)]
[StreamExecutorGpuDevice(id=6, process_index=1)]
[StreamExecutorGpuDevice(id=7, process_index=1)]
[StreamExecutorGpuDevice(id=8, process_index=2)]
[StreamExecutorGpuDevice(id=9, process_index=2)]
[StreamExecutorGpuDevice(id=10, process_index=2)]
[StreamExecutorGpuDevice(id=11, process_index=2)]
[StreamExecutorGpuDevice(id=12, process_index=3)]
[StreamExecutorGpuDevice(id=13, process_index=3)]
[StreamExecutorGpuDevice(id=14, process_index=3)]
[StreamExecutorGpuDevice(id=15, process_index=3)]]
-- 2 Nodes 8 Process --
I1018 04:07:24.959209 140086820651008 partitioning.py:331] global_mesh axis_names: ('data', 'model')
I1018 04:07:24.959397 140086820651008 partitioning.py:332] global_mesh devices: [[StreamExecutorGpuDevice(id=0, process_index=0)]
[StreamExecutorGpuDevice(id=1, process_index=0)]
[StreamExecutorGpuDevice(id=2, process_index=1)]
[StreamExecutorGpuDevice(id=3, process_index=1)]
[StreamExecutorGpuDevice(id=4, process_index=2)]
[StreamExecutorGpuDevice(id=5, process_index=2)]
[StreamExecutorGpuDevice(id=6, process_index=3)]
[StreamExecutorGpuDevice(id=7, process_index=3)]
[StreamExecutorGpuDevice(id=8, process_index=4)]
[StreamExecutorGpuDevice(id=9, process_index=4)]
[StreamExecutorGpuDevice(id=10, process_index=5)]
[StreamExecutorGpuDevice(id=11, process_index=5)]
[StreamExecutorGpuDevice(id=12, process_index=6)]
[StreamExecutorGpuDevice(id=13, process_index=6)]
[StreamExecutorGpuDevice(id=14, process_index=7)]
[StreamExecutorGpuDevice(id=15, process_index=7)]]
I'm exploring how to use t5x in a multi-node GPU setting. I'm using SLURM with a singularity container to execute the training script.
But this doesn't seem to work.
Another method I tried is try to launch two process with hard-coded
process_index
(I did this in an interactive shell)process_index 1 seems to working as intended
but process_index 0 fails.