Open Samanthavsilva opened 1 year ago
For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
While the correct command does not need this flag "--deepspeed_config ds_config.json"
I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed
For the compression one: I see the same issue. Let me dig deeper and report back to you
For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
While the correct command does not need this flag "--deepspeed_config ds_config.json"
I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed
For the compression one: I see the same issue. Let me dig deeper and report back to you
I have just started over in a environment and upgraded deepspeed but I keep getting this issue [2023-10-07 01:37:30,894] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-10-07 01:37:30,894] [INFO] [launch.py:163:main] dist_world_size=2 [2023-10-07 01:37:30,894] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 libnuma: Warning: cpu argument 0-19 is out of range
<0-19> is invalid usage: numactl [--all | -a] [--interleave= | -iJust remove “--deepspeed_config ds_config.json \” in run_ds.sh
I am trying to run cifar but for the one in training folder I get this error and for the one in compression a different error
Python=3.9.16 PyTorch=1.13.0 DeepSpeed=0.9.5 Cuda=11.7
Singularity> bash run_ds.sh [2023-08-16 16:30:21,720] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:22,197] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1 [2023-08-16 16:30:22,214] [INFO] [runner.py:555:main] cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json [2023-08-16 16:30:23,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-08-16 16:30:24,125] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-08-16 16:30:24,125] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-08-16 16:30:24,125] [INFO] [launch.py:163:main] dist_world_size=2 [2023-08-16 16:30:24,125] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None [2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-16 16:30:26,245] [INFO] [comm.py:627:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None Files already downloaded and verified Files already downloaded and verified Files already downloaded and verified truck dog deer cat [2023-08-16 16:30:30,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown Traceback (most recent call last): File "/ocean/projects/cis230018p/ssilva/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 313, in
model_engine, optimizer, trainloader, = deepspeed.initialize(
File "/opt/miniconda/envs/env2/lib/python3.9/site-packages/deepspeed/init__.py", line 146, in initialize
assert config is None, "Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call"
AssertionError: Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call
Files already downloaded and verified
[2023-08-16 16:30:31,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 11518
[2023-08-16 16:30:31,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 11519
Singularity> bash run_compress.sh /jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects
--local-rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( [2023-08-16 16:52:34,170] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) usage: train.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--local_rank LOCAL_RANK] [--lr LR] [--lr-decay LR_DECAY] [--lr-decay-epoch LR_DECAY_EPOCH [LR_DECAY_EPOCH ...]] [--seed S] [--weight-decay W] [--batch-norm] [--residual] [--cuda] [--saving-folder SAVING_FOLDER] [--compression] [--path-to-model PATH_TO_MODEL] [--deepspeed] [--deepspeed_config DEEPSPEED_CONFIG] [--deepscale] [--deepscale_config DEEPSCALE_CONFIG] [--deepspeed_mpi] train.py: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 16794) of binary: /opt/miniconda/envs/env2/bin/python Traceback (most recent call last): File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: