microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.56k stars 4.03k forks source link

DeepSpeed initialization with GNN-like model #2029

Open buttercutter opened 2 years ago

buttercutter commented 2 years ago

My code is quite similar to some GNN structure : NN_output = graph.forward(NN_input, types="f")

So, outputs = model_engine(inputs) seems does not really fit in my case ? args also does not follow such code styling.

Any idea ?

buttercutter commented 2 years ago

I did some coding modifications, however I could not initialize deepspeed properly.

/home/phung/PycharmProjects/venv/py39/bin/python /home/phung/PycharmProjects/beginner_tutorial/gdas.py
Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 17:00:25,770] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 17:00:25,782] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 17:00:27,782] [INFO] [distributed.py:85:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=archlinux, master_port=29500
[2022-07-13 17:00:27,782] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "/home/phung/PycharmProjects/beginner_tutorial/gdas.py", line 936, in <module>
    model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 120, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 238, in __init__
    self._do_args_sanity_check(args)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 900, in _do_args_sanity_check
    assert (
AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file

Process finished with exit code 1

image

tjruwase commented 2 years ago

@buttercutter, you are missing a deepspeed config file on the command passed by --deepspeed_config.

image

Alternatively, you can pass a dict as config_params to deepspeed.initialize()

buttercutter commented 2 years ago

Do you have a recommended deepspeed configuration file ?

Note: The deepspeed configuration for training transformer-like network structure might be different from that for GNN-like network structure.

buttercutter commented 2 years ago

If I use the above configuration file from HuggingFace, I have the following error:

model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_, model_parameters=parameters, training_data=trainset, config_params='./ds_config.json')

/home/phung/PycharmProjects/venv/py39/bin/python /home/phung/PycharmProjects/beginner_tutorial/gdas.py
Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 19:10:10,635] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 19:10:10,648] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 19:10:12,517] [INFO] [distributed.py:85:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=archlinux, master_port=29500
[2022-07-13 19:10:12,517] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "/home/phung/PycharmProjects/beginner_tutorial/gdas.py", line 936, in <module>
    model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 120, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 239, in __init__
    self._configure_with_arguments(args, mpu)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
    self._config = DeepSpeedConfig(self.config, mpu)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 874, in __init__
    self._initialize_params(self._param_dict)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 903, in _initialize_params
    assert not (self.fp16_enabled and self.bfloat16_enabled), 'bfloat16 and fp16 modes cannot be simultaneously enabled'
AssertionError: bfloat16 and fp16 modes cannot be simultaneously enabled

Process finished with exit code 1

Besides, the IDE software also complains on the following two issues.

Cannot find reference 'parse_args' in 'parser.pyi' at line 917

Expected type 'Optional[Module]', got 'filter[Parameter]' instead at line 939

tjruwase commented 2 years ago

DeepSpeed configuration is meant to be network-agnostic, so in reality that configuration file would work except for auto fields which are defined for the HF frontend. The configuration file is used to enable/disable different features of the DeepSpeed framework, rather than to specify or control network properties. You can start with a minimal configuration file that defines just micro_batch_size, optimizer, and logging like below:

{
 "train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1, 
 "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": <add your learning rate>
        }
    }
}

You can progressively add more configuration knobs as you get more familiar with DeepSpeed.

buttercutter commented 2 years ago

I have the following runtime error on conflicting batch_size values ?

ValueError: Expected input batch_size (8) to match target batch_size (1).

Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 13:15:18,174] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 13:15:18,188] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 13:15:18,635] [INFO] [distributed.py:91:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.28.0.2, master_port=29500
[2022-07-13 13:15:18,635] [INFO] [distributed.py:49:init_distributed] Initializing torch distributed with backend: nccl
[2022-07-13 13:15:18,765] [INFO] [engine.py:279:__init__] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.1 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py37_cu113/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 31.784398078918457 seconds
[2022-07-13 13:15:51,799] [INFO] [engine.py:1102:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2022-07-13 13:15:52,015] [INFO] [engine.py:1109:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2022-07-13 13:15:52,015] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-07-13 13:15:52,016] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2022-07-13 13:15:52,016] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-07-13 13:15:52,016] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-07-13 13:15:52,020] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   amp_enabled .................. False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   amp_params ................... False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": null, 
    "exps_dir": null, 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   bfloat16_enabled ............. False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   checkpoint_tag_validation_enabled  True
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   checkpoint_tag_validation_fail  False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   communication_data_type ...... None
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   curriculum_enabled ........... False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   curriculum_params ............ False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dataloader_drop_last ......... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   disable_allgather ............ False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dump_state ................... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dynamic_loss_scale_args ...... None
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_enabled ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_gas_boundary_resolution  1
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_layer_num ......... 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_max_iter .......... 100
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_stability ......... 1e-06
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_tol ............... 0.01
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_verbose ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   elasticity_enabled ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_enabled ................. False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_master_weights_and_gradients  False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_mixed_quantize .......... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   global_rank .................. 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_accumulation_steps .. 1
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_clipping ............ 0.0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_predivide_factor .... 1.0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   initial_dynamic_scale ........ 4294967296
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   loss_scale ................... 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   memory_breakdown ............. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_legacy_fusion ...... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_name ............... adamw
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_params ............. {'lr': 0.05}
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pld_enabled .................. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pld_params ................... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   prescale_gradients ........... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_change_rate ......... 0.001
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_groups .............. 1
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_offset .............. 1000
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_period .............. 1000
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_rounding ............ 0
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_start_bits .......... 16
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_target_bits ......... 8
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_training_enabled .... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_type ................ 0
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_verbose ............. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   scheduler_name ............... None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   scheduler_params ............. None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   sparse_attention ............. None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   sparse_gradients_enabled ..... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   steps_per_print .............. 1
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_enabled .......... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_job_name ......... DeepSpeedJobName
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_output_path ...... 
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   train_batch_size ............. 1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   train_micro_batch_size_per_gpu  1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   use_quantizer_kernel ......... False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   wall_clock_breakdown ......... False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   world_size ................... 1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_allow_untested_optimizer  False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_config .................. {
    "stage": 0, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 5.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": false, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": false, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+09, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_16bit_weights_on_model_save": false, 
    "ignore_unused_parameters": true, 
    "round_robin_gradients": false, 
    "legacy_stage1": false
}
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_enabled ................. False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 0
[2022-07-13 13:15:52,024] [INFO] [config.py:1071:print]   json = {
    "train_micro_batch_size_per_gpu": 1, 
    "steps_per_print": 1, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 0.05
        }
    }
}
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py37_cu113/utils...
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 16.06555199623108 seconds
run_num =  0
Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 690, in train_NN
    Ltrain = criterion(NN_output, NN_train_labels)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).
[85b173f58da1:00656] *** Process received signal ***
[85b173f58da1:00656] Signal: Segmentation fault (11)
[85b173f58da1:00656] Signal code: Address not mapped (1)
[85b173f58da1:00656] Failing at address: 0x7f751665320d
[85b173f58da1:00656] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f75192fd980]
[85b173f58da1:00656] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f7518f3c775]
[85b173f58da1:00656] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f75197a7e44]
[85b173f58da1:00656] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f7518f3d605]
[85b173f58da1:00656] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f75197a5cb3]
[85b173f58da1:00656] *** End of error message ***
tjruwase commented 2 years ago

Set "train_micro_batch_size_per_gpu" to 8 in the configuration file.

buttercutter commented 2 years ago

May I ask if retain_graph=True is fully supported now ?

tjruwase commented 2 years ago

It should be, but please report any issues.

buttercutter commented 2 years ago

model_engine.backward(Ltrain, retain_graph=True) gave the following error ?

Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 700, in train_NN
    model_engine.backward(Ltrain, retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
TypeError: backward() got an unexpected keyword argument 'retain_graph'
buttercutter commented 2 years ago

@tjruwase May I know why retain_graph still does not work for me ?

tjruwase commented 2 years ago

Sorry, it appears #1149 was never merged. Unfortunately, it has a conflict with master. Can you please try picking that up?

tjruwase commented 2 years ago

@buttercutter, #1149 is now merged. Please try master.

buttercutter commented 2 years ago

@tjruwase

Why Expected type 'Module | None', got 'filter[Parameter]' instead error for model_parameter ?

image

tjruwase commented 2 years ago

This is a type error. Please see doc for deepspeed.initialize().

buttercutter commented 2 years ago

The same code works perfectly fine within google colab GPU cloud environment.

So, I guess this above type error is due to local installation issue.

However, deepspeed still give RuntimeError: CUDA out of memory. Could you advise what could have gone wrong ?

image

tjruwase commented 2 years ago

The same code works perfectly fine within google colab GPU cloud environment.

So, I guess this above type error is due to local installation issue.

This is quite strange. It would be good to figure out what is different about the local and colab installations. Do you mind printing out the types of every parameter passed to deepspeed.initialize()?

buttercutter commented 2 years ago

Exception: Installed CUDA version 11.7 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.

Local installation seems to have failed with some CUDA and torch version incompatibility.

The following is the output for online google colab GPU cloud environment.

print("type(args) = ", type(args_))
print("type(graph_) = ", type(graph_))
print("type(parameters) = ", type(parameters))
print("type(trainset) = ", type(trainset))

type(args) =  <class 'argparse.Namespace'>
type(graph_) =  <class '__main__.Graph'>
type(parameters) =  <class 'filter'>
type(trainset) =  <class 'torchvision.datasets.cifar.CIFAR10'>
buttercutter commented 1 year ago

@tjruwase I see no issue with the initialization coding at least within the working online google colab GPU cloud environment.

Shall I open up a different github issue since this is an entirely different problem ?

tjruwase commented 1 year ago

@buttercutter, yes, please open a new issue. Thanks!