microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.99k stars 1.02k forks source link

Errors in cifar training and compression #690

Open Samanthavsilva opened 1 year ago

Samanthavsilva commented 1 year ago

I am trying to run cifar but for the one in training folder I get this error and for the one in compression a different error

Python=3.9.16 PyTorch=1.13.0 DeepSpeed=0.9.5 Cuda=11.7

Singularity> bash run_ds.sh [2023-08-16 16:30:21,720] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:22,197] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1 [2023-08-16 16:30:22,214] [INFO] [runner.py:555:main] cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json [2023-08-16 16:30:23,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1 [2023-08-16 16:30:24,125] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2023-08-16 16:30:24,125] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-08-16 16:30:24,125] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-08-16 16:30:24,125] [INFO] [launch.py:163:main] dist_world_size=2 [2023-08-16 16:30:24,125] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None [2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-16 16:30:26,245] [INFO] [comm.py:627:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None Files already downloaded and verified Files already downloaded and verified Files already downloaded and verified truck dog deer cat [2023-08-16 16:30:30,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown Traceback (most recent call last): File "/ocean/projects/cis230018p/ssilva/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 313, in model_engine, optimizer, trainloader, = deepspeed.initialize( File "/opt/miniconda/envs/env2/lib/python3.9/site-packages/deepspeed/init__.py", line 146, in initialize assert config is None, "Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call" AssertionError: Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed.initialize() function call Files already downloaded and verified [2023-08-16 16:30:31,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 11518 [2023-08-16 16:30:31,147] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 11519

Singularity> bash run_compress.sh /jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( [2023-08-16 16:52:34,170] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) usage: train.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--local_rank LOCAL_RANK] [--lr LR] [--lr-decay LR_DECAY] [--lr-decay-epoch LR_DECAY_EPOCH [LR_DECAY_EPOCH ...]] [--seed S] [--weight-decay W] [--batch-norm] [--residual] [--cuda] [--saving-folder SAVING_FOLDER] [--compression] [--path-to-model PATH_TO_MODEL] [--deepspeed] [--deepspeed_config DEEPSPEED_CONFIG] [--deepscale] [--deepscale_config DEEPSCALE_CONFIG] [--deepspeed_mpi] train.py: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 16794) of binary: /opt/miniconda/envs/env2/bin/python Traceback (most recent call last): File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in main() File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

PareesaMS commented 11 months ago

For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

While the correct command does not need this flag "--deepspeed_config ds_config.json"

I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed

For the compression one: I see the same issue. Let me dig deeper and report back to you

Samanthavsilva commented 11 months ago

For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

While the correct command does not need this flag "--deepspeed_config ds_config.json"

I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed

For the compression one: I see the same issue. Let me dig deeper and report back to you

I have just started over in a environment and upgraded deepspeed but I keep getting this issue [2023-10-07 01:37:30,894] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-10-07 01:37:30,894] [INFO] [launch.py:163:main] dist_world_size=2 [2023-10-07 01:37:30,894] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 libnuma: Warning: cpu argument 0-19 is out of range

<0-19> is invalid usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ] [--physcpubind= | -C ] [--cpunodebind= | -N ] [--membind= | -m ] [--localalloc | -l] command args ... numactl [--show | -s] numactl [--hardware | -H] numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ] [--strict | -t] [--shmid | -I ] --shm | -S [--shmid | -I ] --file | -f [--huge | -u] [--touch | -T] memory policy | --dump | -d | --dump-nodes | -D memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l is a comma delimited list of node numbers or A-B ranges or all. Instead of a number a node can also be: netdev:DEV the node connected to network device DEV file:PATH the node the block device of path is connected to ip:HOST the node of the network device host routes through block:PATH the node of block device path pci:[seg:]bus:dev[:func] The node of a PCI device is a comma delimited list of cpu numbers or A-B ranges or all all ranges can be inverted with ! all numbers and ranges can be made cpuset-relative with + the old --cpubind argument is deprecated. use --cpunodebind or --physcpubind instead can have g (GB), m (MB) or k (KB) suffixes [2023-10-07 01:37:30,949] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 70145 libnuma: Warning: cpu argument 20-39 is out of range <20-39> is invalid usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ] [--physcpubind= | -C ] [--cpunodebind= | -N ] [--membind= | -m ] [--localalloc | -l] command args ... numactl [--show | -s] numactl [--hardware | -H] numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ] [--strict | -t] [--shmid | -I ] --shm | -S [--shmid | -I ] --file | -f [--huge | -u] [--touch | -T] memory policy | --dump | -d | --dump-nodes | -D memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l is a comma delimited list of node numbers or A-B ranges or all. Instead of a number a node can also be: netdev:DEV the node connected to network device DEV file:PATH the node the block device of path is connected to ip:HOST the node of the network device host routes through block:PATH the node of block device path pci:[seg:]bus:dev[:func] The node of a PCI device is a comma delimited list of cpu numbers or A-B ranges or all all ranges can be inverted with ! all numbers and ranges can be made cpuset-relative with + the old --cpubind argument is deprecated. use --cpunodebind or --physcpubind instead can have g (GB), m (MB) or k (KB) suffixes [2023-10-07 01:37:30,950] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 70148
Blenderama commented 9 months ago

Just remove “--deepspeed_config ds_config.json \” in run_ds.sh