microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.71k stars 4.05k forks source link

[BUG] KeyError: 'LOCAL_RANK' #1682

Open ShivamSharma2705 opened 2 years ago

ShivamSharma2705 commented 2 years ago

Hey guys,

When I try to train a new gpt2 model using pretrain_gpt.sh I get the following error


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] sparse_attn ............ [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/people/shar703/anaconda3/envs/mega_ai/lib/python3.8/site-packages/torch'] torch version .................... 1.7.1 torch cuda version ............... 11.0 nvcc version ..................... 11.0 deepspeed install path ........... ['/people/shar703/anaconda3/envs/mega_ai/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.5.9, unknown, unknown deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0 Git info for Megatron: git_hash=1ac4a44 git_branch=main using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None checkpoint_activations .......................... True checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... False cpu_torch_adam .................................. False curriculum_learning ............................. False data_impl ....................................... mmap data_parallel_size .............................. 1 data_path ....................................... ['cord19/chemistry_cord19_abstract_document'] dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... False deepspeed_activation_checkpointing .............. False deepspeed_config ................................ None deepspeed_mpi ................................... False distribute_checkpointed_activations ............. False distributed_backend ............................. nccl embedding_path .................................. None encoder_seq_length .............................. 1024 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 10 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_dim ......................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kv_channels ..................................... 64 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ checkpoints/cord19_gpt2_345m local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_decay_tokens ................................. None lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 memory_centric_tiled_linear ..................... False merge_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-merges.txt micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False params_dtype .................................... torch.float16 partition_activations ........................... False patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 profile_backward ................................ False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 remote_device ................................... none reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 sample_rate ..................................... 1.0 save ............................................ checkpoints/cord19_gpt2_345m save_interval ................................... 10000 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False seed ............................................ 1234 seq_length ...................................... 1024 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... 949,50,1 split_transformers .............................. False synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 tile_factor ..................................... 1 titles_data_path ................................ None tokenizer_type .................................. GPT2BPETokenizer train_iters ..................................... 500000 train_samples ................................... None train_tokens .................................... None use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_ddp ................... False use_cpu_initialization .......................... None use_one_sent_docs ............................... False use_pin_memory .................................. False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-vocab.json weight_decay .................................... 0.01 world_size ...................................... 1 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 1.0 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 compiling dataset index builder ... make: Entering directory /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data' make: Nothing to be done fordefault'. make: Leaving directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data'

done with dataset index builder. Compilation time: 0.124 seconds compiling and loading fused kernels ... Detected CUDA files, patching ldflags Emitting ninja build file /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module scaled_upper_triang_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_upper_triang_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module scaled_masked_softmax_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module scaled_masked_softmax_cuda... Detected CUDA files, patching ldflags Emitting ninja build file /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... Building extension module fused_mix_prec_layer_norm_cuda... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_mix_prec_layer_norm_cuda... done with compiling and loading fused kernels. Compilation time: 1.697 seconds time to initialize megatron (seconds): 45.207 [after megatron is initialized] datetime: 2022-01-07 09:41:01 building GPT model ... [2022-01-07 09:41:01,270] [INFO] [utils.py:822:see_memory_usage] Before Building Model [2022-01-07 09:41:01,271] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2022-01-07 09:41:01,271] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory: used = 154.89 GB, percent = 15.4% Traceback (most recent call last): File "pretrain_gpt.py", line 231, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 131, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 334, in setup_model_and_optimizer model = get_model(model_provider_func) File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 232, in get_model model = model_provider_func( File "pretrain_gpt.py", line 44, in model_provider with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(), File "/people/shar703/anaconda3/envs/mega_ai/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 480, in init self.local_device = torch.device('cuda:{}'.format(os.environ["LOCAL_RANK"])) File "/people/shar703/anaconda3/envs/mega_ai/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'LOCAL_RANK'

yetiansh commented 2 years ago

Also get this error in Deepspeed v0.5.9. If you only use pretrain_gpt.sh to train on single card, export LOCAL_RANK=0 would be an ad hoc solution.

jeffra commented 2 years ago

The LOCAL_RANK environment variable is set by either the deepspeed launcher or the pytorch launcher (e.g., torch.distributed.launch). I would suggest launching via one of these two methods.

Samanthavsilva commented 1 year ago

I am having the same issue were you able to fix it?

shoang22 commented 1 year ago

The LOCAL_RANK environment variable is set by either the deepspeed launcher or the pytorch launcher (e.g., torch.distributed.launch). I would suggest launching via one of these two methods.

What about when using Pytorch Lightning on a SLURM cluster?

lizc126 commented 7 months ago

The LOCAL_RANK environment variable is set by either the deepspeed launcher or the pytorch launcher (e.g., torch.distributed.launch). I would suggest launching via one of these two methods.

What about when using Pytorch Lightning on a SLURM cluster?

Hi, I met the same issue, hv you found any solutions?