2PC with 4GPUs each - Githubissues

sophieyl820 commented 1 year ago

I have 2 machines, with 4 GPUs each on 192.168.2.14, I ran CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 4 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2 and report

tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 82, in main
    generator = load(
  File "example.py", line 44, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=8 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1426263) of binary: /home/winner/anaconda3/envs/py38_pt1110/bin/python
Traceback (most recent call last):
  File "/home/winner/anaconda3/envs/py38_pt1110/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1426264)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1426265)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1426266)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1426263)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

fabawi commented 1 year ago

Did you try --nproc_per_node 1 ? Setting CUDA_VISIBLE_DEVICES=0 means the model only sees one GPU, but tries to run on 4

fabawi commented 1 year ago

Another thing you'd want to change in order to run on 2 machines, each with 4 GPUs, you'd have to pass the arguments:

--wrapyfi_device_idx 7 --wrapyfi_total_devices 8

Instead of 1&2. This does not follow the fairscale distribution style with ranks and world sizes. Here, each node knows nothing about the others, and only knows how large the world (in this case --wrapyfi_total_devices) is and where to listen. Check out the 4 machine variant in the readme, and use that same approach for larger models/more machines by simply changing the model weights location, --wrapyfi_device_idx 7 and --wrapyfi_total_devices 8, and then for each instance running in a terminal, run the same command, reducing --wrapyfi_device_idx by 1 until you reach 0

Remember to change the cuda visible devices according to the device you'd like the partial LLaMA instance to run on. So if you run the last instance (idx = 7), you can set cuda visible devices to 0. On the other machine, you also want to run, say idx = 3 also on device 0 (here meaning the cuda device), the idx = 2 on cuda device 1 and so on.

sophieyl820 commented 1 year ago

Another thing you'd want to change in order to run on 2 machines, each with 4 GPUs, you'd have to pass the arguments:

--wrapyfi_device_idx 7 --wrapyfi_total_devices 8

Instead of 1&2. This does not follow the fairscale distribution style with ranks and world sizes. Here, each node knows nothing about the others, and only knows how large the world (in this case --wrapyfi_total_devices) is and where to listen. Check out the 4 machine variant in the readme, and use that same approach for larger models/more machines by simply changing the model weights location, --wrapyfi_device_idx 7 and --wrapyfi_total_devices 8, and then for each instance running in a terminal, run the same command, reducing --wrapyfi_device_idx by 1 until you reach 0

Remember to change the cuda visible devices according to the device you'd like the partial LLaMA instance to run on. So if you run the last instance (idx = 7), you can set cuda visible devices to 0. On the other machine, you also want to run, say idx = 3 also on device 0 (here meaning the cuda device), the idx = 2 on cuda device 1 and so on.

I ran on 192.168.2.14 machine, but still got same assertion error... if I block the assert line, will come out a load checkpoint error

CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 1 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 7 --wrapyfi_total_devices 8

modular-ml / wrapyfi-examples_llama

2PC with 4GPUs each #2