microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.67k stars 4.15k forks source link

DeepSpeed still gives CUDA-out-of-memory issue #2302

Open buttercutter opened 2 years ago

buttercutter commented 2 years ago

May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?

image

See this for historical tracking purpose.

tjruwase commented 2 years ago

DeepSpeed's memory optimizations are enabled via ZeRO stages >=1 and most of them are effective in multi-gpu cases. Your log shows that ZeRO is not enabled. image

Please see these tutorials

  1. https://www.deepspeed.ai/tutorials/zero/
  2. https://www.deepspeed.ai/tutorials/zero-offload/
buttercutter commented 2 years ago
[2022-09-09 15:01:50,764] [INFO] [logging.py:68:log_dist] [Rank 0] step=1248, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:50,765] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=14.454011142357578, CurrSamplesPerSec=9.851701418075939, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:51,547] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:51,547] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=14.449232505063256, CurrSamplesPerSec=10.233603997501561, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:52,349] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:52,349] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=14.444045524931788, CurrSamplesPerSec=9.977597824784409, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 737, in train_NN
    torch.save(graph, path)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 379, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 589, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn'
[6eb301421b63:00906] *** Process received signal ***
[6eb301421b63:00906] Signal: Segmentation fault (11)
[6eb301421b63:00906] Signal code: Address not mapped (1)
[6eb301421b63:00906] Failing at address: 0x7f4e4f16420d
[6eb301421b63:00906] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f4e52011980]
[6eb301421b63:00906] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f4e51c50775]
[6eb301421b63:00906] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f4e524bbe44]
[6eb301421b63:00906] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f4e51c51605]
[6eb301421b63:00906] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f4e524b9cb3]
[6eb301421b63:00906] *** End of error message ***

I have updated the optimization pass as follows, however I am getting another different error related to pickler above.

{
    "train_micro_batch_size_per_gpu": 8,
    "steps_per_print": 1, 
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.05
        }
    },

    "zero_optimization": {
        "stage": 3,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 1e7,
        "stage3_param_persistence_threshold": 1e5,
        "reduce_bucket_size": 1e7,
        "sub_group_size": 1e9,
        "offload_optimizer": {
            "device": "cpu"
         },
        "offload_param": {
            "device": "cpu"
       }
   }
}
tjruwase commented 2 years ago

@buttercutter, can you please try stage == 1? We can return to this stage 3 issue later.

buttercutter commented 2 years ago

@tjruwase "stage": 1 resulted in the following out-of-memory issue

[2022-09-09 15:45:39,534] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.891309503059055, CurrSamplesPerSec=12.814082860076859, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,142] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.87964277609462, CurrSamplesPerSec=13.148494490509256, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,755] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.867808169940616, CurrSamplesPerSec=13.059329395683946, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 955, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 780, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 553, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 439, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 317, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 274, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 164, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 149, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6eb301421b63:01205] *** Process received signal ***
[6eb301421b63:01205] Signal: Segmentation fault (11)
[6eb301421b63:01205] Signal code: Address not mapped (1)
[6eb301421b63:01205] Failing at address: 0x7f556779a20d
[6eb301421b63:01205] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f556a647980]
[6eb301421b63:01205] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f556a286775]
[6eb301421b63:01205] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f556aaf1e44]
[6eb301421b63:01205] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f556a287605]
[6eb301421b63:01205] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f556aaefcb3]
[6eb301421b63:01205] *** End of error message ***
tjruwase commented 2 years ago

@buttercutter, got it. This out-of-memory with stage 1 is quite strange because the model seems to train for a while, completing multiple iterations, before failing. The compute and memory pattern of model training is typically regular so that any out-of-memory issues should show up within one or two iterations. Is it possible that you have a memory leak? Can you please use deepspeed memory profiler as discussed here? Thanks!

buttercutter commented 2 years ago

@tjruwase From the code link you provided just above,

  1. What is forward_step_func ? I suppose this is not just simply the model wrapper as it feeds in args.curriculum_learning ?
  2. Do I actually need to use from megatron import get_timers ? Since this is a timer, does this actually log the memory profile for each backpropagation paths ?
tjruwase commented 2 years ago
  1. forward_step_func is just the forward pass call. Actually, it could be any function that are you interested in how it affects memory.
  2. No, you don't need timers import from megatron. Sorry, I just noticed that see_memory_usage() is commented out in the code snippet. image

Let's say you had a function foo, we can profile memory usage like below.


 from deepspeed.runtime.utils import see_memory_usage

 see_memory_usage(f'memory usage before foo', force=True)
 foo(...)
see_memory_usage(f'memory usage after foo', force=True)
`
buttercutter commented 2 years ago
[phung@archlinux gdas]$ git status --short
 M ds_config.json
 M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf

 # import numpy as np

+VISUALIZER = 0
+DEBUG = 0
+
 # deepspeed zero offload https://www.deepspeed.ai/getting-started/
 # https://github.com/microsoft/DeepSpeed/issues/2029
 USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
     import argparse
     import deepspeed

-VISUALIZER = 0
-DEBUG = 0
+    from deepspeed.runtime.utils import see_memory_usage
+
 logdir = 'runs/gdas_experiment_1'

 if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
     while not_converged:
         print("run_num = ", run_num)

+        see_memory_usage(f'memory usage before train_NN()', force=True)
         ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+        see_memory_usage(f'memory usage after train_NN()', force=True)
+
         print("Finished train_NN()")

         if VISUALIZER or DEBUG:
[phung@archlinux gdas]$ 
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
 {
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1, 
- "optimizer": {
+    "train_micro_batch_size_per_gpu": 8,
+    "steps_per_print": 1, 
+    "optimizer": {
         "type": "AdamW",
         "params": {
             "lr": 0.05
         }
-    }
+    },
+
+    "zero_optimization": {
+        "stage": 1,
+        "contiguous_gradients": true,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_prefetch_bucket_size": 1e7,
+        "stage3_param_persistence_threshold": 1e5,
+        "reduce_bucket_size": 1e7,
+        "sub_group_size": 1e9,
+        "offload_optimizer": {
+            "device": "cpu"
+         },
+        "offload_param": {
+            "device": "cpu"
+       }
+   }
 }
[phung@archlinux gdas]$ 

@tjruwase With the above modifications to gdas.py training source code as well as the deepspeed json config file, I have the following terminal output log which is contained inside gdas.ipynb.zip. May I ask if the following log implies any hints of the locations of the memory leak, if there are ?

Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)] 
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}

run_num =  0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.04 GB         Max_CA 0 GB 
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]

[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.19 GB, percent = 33.0%
Finished train_NN()
buttercutter commented 2 years ago

@tjruwase Sorry that I had put see_memory_usage for the wrong function call in the previous message.
It should be for train_architecture() instead of train_NN().

[2022-09-10 03:20:34,729] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=22.113168856352097, CurrSamplesPerSec=13.503006889446912, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,340] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,340] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=22.100969229749683, CurrSamplesPerSec=13.097596801095447, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,937] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,938] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=22.089474650856374, CurrSamplesPerSec=13.399278091939252, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:36,715] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 03:20:36,716] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.1 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,716] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Finished train_NN()
[2022-09-10 03:20:36,847] [INFO] [utils.py:827:see_memory_usage] memory usage before train_architecture()
[2022-09-10 03:20:36,847] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB         Max_MA 0.04 GB         CA 0.1 GB         Max_CA 0 GB 
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[a31af820d2a8:01199] *** Process received signal ***
[a31af820d2a8:01199] Signal: Segmentation fault (11)
[a31af820d2a8:01199] Signal code: Address not mapped (1)
[a31af820d2a8:01199] Failing at address: 0x7f3db597220d
[a31af820d2a8:01199] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f3db8416980]
[a31af820d2a8:01199] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f3db8055775]
[a31af820d2a8:01199] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f3db88c0e44]
[a31af820d2a8:01199] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f3db8056605]
[a31af820d2a8:01199] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f3db88becb3]
[a31af820d2a8:01199] *** End of error message ***
tjruwase commented 2 years ago

Unfortunately, that is not very useful. Can you please add "memory_breakdown": true into your json config?

buttercutter commented 2 years ago

Attached : gdas.ipynb.zip

@tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to "memory_breakdown": true

buttercutter commented 2 years ago

@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the gumbel function might had contributed to the issue, although I am not quite sure how this is possible. Is there a way to go around this ?

buttercutter commented 2 years ago

@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ?

    # self-defined initial NAS architecture, for supernet architecture edge weight training
    def forward_edge(self, x):
        self.__freeze_f()
        self.__unfreeze_w()

        # Refer to GDAS equations (5) and (6)
        # if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
        # It's not required, but you don't know, which index is one hot encoded 1.
        # https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
        # See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163

        gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
        chosen_edge = torch.argmax(gumbel, dim=0)  # converts one-hot encoding into integer

        return chosen_edge
tjruwase commented 2 years ago

@buttercutter, I don't understand what gumbel is doing. So, you can please clarify your expectations for DeepSpeed in this case, and what you mean by "further optimize"? Thanks!

buttercutter commented 2 years ago

@tjruwase Sorry for the wording disambiguation, it should be memory offloading to cpu instead of further optimize.

Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond.

[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
  File "gdas.py", line 962, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 783, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 556, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 442, in forward
    self.nodes[n].forward(x, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 320, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 277, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 167, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) 
buttercutter commented 2 years ago

@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain.

See the difference between equations (3) and (5)

image

tjruwase commented 2 years ago

@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a train_micro_batch_size_per_gpu of 1 to test this? Also, assuming you are now comfortable with using see_memory_usage(), can add it just before the gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) call?

tjruwase commented 2 years ago

@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs

  1. PyTorch: https://pytorch.org/docs/stable/checkpoint.html
  2. DeepSpeed: https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html
buttercutter commented 2 years ago

@tjruwase if I use train_micro_batch_size_per_gpu of 1, I have the following error which had been spotted, reported and resolved earlier in another previous github issue

Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 690, in train_NN
    Ltrain = criterion(NN_output, NN_train_labels)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).
tjruwase commented 2 years ago

@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of 1 to match the command line value of 8. But now, since we are trying to profile the memory behavior without encountering out-of-memory errors, so it makes sense to modify the command line value to 1 to match the ds_config. Does that make sense?

buttercutter commented 2 years ago

@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to 1 for testing purpose. So far, the GPU RAM memory usage stays around 4GB.

However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ?

tjruwase commented 2 years ago

@buttercutter, thanks for the update.

No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training

  1. Model state: parameters, gradients, optimizer state
  2. Activations due to inputs

ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:

  1. Measure impact of batch size (activations) on memory by trying batch size 2 and 4
  2. Use see_memory_usage() to track memory utilization at vital points, such as before/after forward and before/after backward for the batch sizes that don't cause out-of-memory errors.
  3. Enable activation checkpointing to see the impact
buttercutter commented 2 years ago

@tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ?

and how would I perform checkpointing.configure() for this particular gumbel-max function : gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True) ?

as for the DeepSpeed Config json file, how shall I properly modify the following params ?

 "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
    }
tjruwase commented 2 years ago

@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.

  1. Wrap the forward function of the client modeling code with deepspeed's checkpoint() as done in this file. In your case, I guess you want to wrap a caller of this.
  2. Configure the deepspeed config json by trying out different settings. I can help with that later.

However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works.

buttercutter commented 2 years ago

@tjruwase I used batch_size of 4, and it also resulted in memory usage of around 4GB with the help from see_memory_usage() on gumbel-softmax function, what do you suggest in this particular case before proceeding to using activation checkpointing ?

[2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB         Max_MA 0.61 GB         CA 0.68 GB         Max_CA 1 GB 
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.44 GB, percent = 35.0%
buttercutter commented 2 years ago

@tjruwase if I stick with batch_size of 4, the training process runs smoothly without cuda out-of-memory issue so far, but of course the training time is much longer compared to batch_size of 8, 16 or 32. So, I suppose it is an incentive to use activation checkpointing in this particular case ? could you advise on this ?

buttercutter commented 2 years ago

@tjruwase it is a bit strange that see_memory_usage() returned the same memory usage for both before and after gumbel_softmax() function

What do you suggest then ? See the following:

File "gdas.py", line 152, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) 
tjruwase commented 2 years ago

@buttercutter, yes it is quite mysterious to get the same memory usage before and after gumbel_softmax. This suggests that something (e.g., a kernel) is allocating memory outside the view of the cuda memory allocator. If you are willing to be adventurous, I suggest more finer-grained memory profiling by wrapping the following line with see_memory_usage


  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
buttercutter commented 2 years ago

@tjruwase see the following log which seems not that useful for further debugging


[2022-11-09 04:45:52,214] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,215] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,215] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,345] [INFO] [utils.py:827:see_memory_usage] memory usage after variable ret
[2022-11-09 04:45:52,346] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,346] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,484] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,485] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB         Max_MA 12.34 GB         CA 13.71 GB         Max_CA 14 GB 
[2022-11-09 04:45:52,485] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.63 GB, percent = 36.5%
Traceback (most recent call last):
  File "gdas.py", line 960, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val=&#039;val&#039;)
  File "gdas.py", line 785, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 558, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 449, in forward
    self.nodes[n].forward(x2, node_num=n, types=types)  # Ltrain(w±, alpha)
  File "gdas.py", line 322, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 279, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 169, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 154, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1899, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 15.75 MiB free; 13.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[3a98c6920156:00675] *** Process received signal ***
[3a98c6920156:00675] Signal: Segmentation fault (11)
[3a98c6920156:00675] Signal code: Address not mapped (1)
[3a98c6920156:00675] Failing at address: 0x7f7953ad920d
[3a98c6920156:00675] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f7956783980]
[3a98c6920156:00675] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f79563c2775]
[3a98c6920156:00675] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f7956c2de44]
[3a98c6920156:00675] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f79563c3605]
[3a98c6920156:00675] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f7956c2bcb3]
[3a98c6920156:00675] *** End of error message ***
tjruwase commented 2 years ago

The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct?

buttercutter commented 2 years ago

@tjruwase Not correct, the log printout from see_memory_usage() actually happened for both forward pass and backward pass.

May I ask how shall I modify the debug printout flow such that it only logs printout from see_memory_usage() ONLY for backward pass ?

tjruwase commented 2 years ago

I see, thanks for the clarification. I thought the 2nd was also forward pass because of multiple forward* and references in the stack trace, as well as the softmax computation. So, does the backward pass also include a forward pass that causes this stack trace? I am trying to get a mental model of the execution flow as the model is behaving differently from my experience.

You don't need to restrict see_memory_usage for backwrad pass only. Also, can you share equivalent log snippet of the smaller micro-batch size that runs without OOM?

tjruwase commented 2 years ago

Also, can you please share the memory usage for batch sizes 1, 2, and 4, so we can understand any pattern from increasing batch size? Thanks!

buttercutter commented 2 years ago

You are right about those 2 see_memory_usage() in the log snippet above (which is using batch size of 256) were being executed during forward pass.

The reason I asked for see_memory_usage() to be only in the backwards pass is because logging printf() takes a very huge toll on overall running time for forward pass, and I am using limited computing unit inside google cloud colab session. So extra runtime will cost money in this case.

Could you advise about limiting see_memory_usage() to be just executed inside backwards pass because even smaller batch size of 2 or 4 will only worsen this extremely long runtime issue when see_memory_usage() is enabled for every forward pass ?

tjruwase commented 2 years ago

You can call see_memory_usage() while setting force to True only for backward.

image
buttercutter commented 2 years ago

I googled but there is no specific built-in PyTorch variable that could actually differentiate forward and backward pass during runtime execution ?

Please correct me if I miss anything,

tjruwase commented 2 years ago

Can you define you own global variable that you set/clear around the backward call?

buttercutter commented 2 years ago

I followed your suggestion on using global variable, see the log below:

Here is the exact source files used to obtain the following log snippet

[2022-11-11 08:40:17,899] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:17,899] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,036] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:18,037] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,037] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,049] [INFO] [logging.py:68:log_dist] [Rank 0] step=2498, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:18,050] [INFO] [timer.py:207:stop] 0/2498, RunningAvgSamplesPerSec=3.066608367950589, CurrSamplesPerSec=2.2201241199743493, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:18,191] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:18,191] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,191] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,328] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:18,329] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,330] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,554] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:18,555] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,555] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,703] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:18,704] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:18,704] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:19,893] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:19,893] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:19,894] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,024] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:20,025] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,025] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,036] [INFO] [logging.py:68:log_dist] [Rank 0] step=2499, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:20,037] [INFO] [timer.py:207:stop] 0/2499, RunningAvgSamplesPerSec=3.066099503151551, CurrSamplesPerSec=2.168111862121388, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:20,174] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:20,175] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,175] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,308] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:20,309] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,309] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,531] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:20,532] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,532] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,666] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:20,667] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:20,667] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,834] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:21,835] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.08 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:21,835] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,971] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:21,972] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:21,972] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=2500, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:21,983] [INFO] [timer.py:207:stop] 0/2500, RunningAvgSamplesPerSec=3.0656260018592953, CurrSamplesPerSec=2.2124665882548116, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:22,127] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:22,128] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB         Max_MA 0.02 GB         CA 0.08 GB         Max_CA 0 GB 
[2022-11-11 08:40:22,128] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 4.07 GB, percent = 32.1%
Finished train_NN()
Traceback (most recent call last):
  File "gdas.py", line 963, in <module>
    lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
  File "gdas.py", line 788, in train_architecture
    graph.forward(val_inputs, types="edge")  # Lval(w*, alpha)
  File "gdas.py", line 559, in forward
    self.cells[c].forward(x, x1, x2, c, types=types)
  File "gdas.py", line 449, in forward
    self.nodes[n].forward(x1, node_num=n, types=types) + \
  File "gdas.py", line 323, in forward
    y = self.connections[cc].forward(x, types)
  File "gdas.py", line 280, in forward
    edges_results = edges_results + self.edges[e].forward(x, types)
  File "gdas.py", line 170, in forward
    y_hat = self.forward_edge(x)
  File "gdas.py", line 155, in forward_edge
    gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1900, in gumbel_softmax
    ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.51 GiB already allocated; 3.75 MiB free; 13.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6b0e6168b82c:01203] *** Process received signal ***
[6b0e6168b82c:01203] Signal: Segmentation fault (11)
[6b0e6168b82c:01203] Signal code: Address not mapped (1)
[6b0e6168b82c:01203] Failing at address: 0x7fbd6e68520d
[6b0e6168b82c:01203] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fbd71530980]
[6b0e6168b82c:01203] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7fbd7116f775]
[6b0e6168b82c:01203] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7fbd719dae44]
[6b0e6168b82c:01203] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7fbd71170605]
[6b0e6168b82c:01203] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7fbd719d8cb3]
[6b0e6168b82c:01203] *** End of error message ***
tjruwase commented 2 years ago

Thanks for the sharing the log. I have a few questions

  1. What is the batch size?
  2. It seems the train ran smoothly for 2500 steps without OOM. Is this correct? This is surprising because in my experience OOMs occur within the first 2 or 3 steps since memory usage of DL is quite stable.
  3. What is the meaning of this message that appears before the stack trace: Finished train_NN()
  4. Can you please add the following lines before each iteration? I am curious whether this is memory fragmentation.
        gc.collect()
        torch.cuda.empty_cache()
buttercutter commented 2 years ago

I have attached all the relevant source files here

  1. batch size = 4

  2. Yes, ran smoothly for train_NN() before train_architecture(). Remember this is a bi-level optimization (See algorithm 1).

  3. Do we need to take care of the log snippet before Finished train_NN() since the issue is inside train_architecture() which is executed after Finished train_NN() ?

  4. Let me do as you suggested. Might I ask if this is to be put inside /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py ? Is it to be put before or after ret = y_hard - y_soft.detach() + y_soft ?

buttercutter commented 2 years ago

@tjruwase

If you checked the modified gdas.py with your suggestion with the CUDA oom error log , can I imply that it is not due to memory fragmentation as you had suggested earlier ?

Please advise.

tjruwase commented 2 years ago

Thanks for sharing the code and log. I agree that this is probably not a memory fragmentation issue. Also, you can remove the logging in train_NN() as that runs fine and the failure is in train_architecture().

Also, I think the following log that you collected from gumbel_softmax() might have the explanation; sorry for not realizing this earlier

image

Let's also combine with the OOM message:

image

There are two things to notice here:

  1. The Max_CA is 14GB before the code.
  2. The total GPU memory available to the process is 14.76GB, not the total 16GB.

This is already an OOM condition that was lucky not to have failed earlier, because any attempt to allocate even a small amount of intermediate memory, like 20MB in this case is sufficient to cause a failure.

I don't yet see the cause of this high memory usage. But it seems to me that the memory profile of train_NN() is very different from train_architecture(), do you agree?

buttercutter commented 2 years ago

I cannot comment on the memory pattern for train_architecture() because its execution path was terminated due to oom in its first epoch.

train_NN() itself already consumed about 14GB of GPU memory, strange. Is it not the purpose of deepspeed to make sure that any function execution path will not consume this much GPU memory ?

Please suggest any possible solution in this particular case.

tjruwase commented 2 years ago

I think you misunderstood my point, so let me restate:

  1. The memory log contains 4 numbers: memory consumed by model is labelled MA while memory consumed by PyTorch cache allocator is Max_CA. Please see here for further discussion of CUDA memory management.
  2. train_NN() consumes just 0.2GB, see the screenshot below of the log at the end of train_NN() image
  3. The model memory consumption in train_architecture() just before crash is 12.34GB. Since this happening during forward it indicates activation memory consumption. image
  4. As I said earlier, are two major sources of memory consumption in model training. In this case, activation memory consumption seems to be the problem, and ZeRO is not designed for that. So, you need activation checkpointing. This is why it is important to understand the memory pattern of the execution.

Hope that is helpful.

buttercutter commented 2 years ago

In this case then, I will proceed with the use of activation checkpointing.

It seems that you are right about the difficulty of implementing custom forward function with activation checkpointing given that my current model architecture is based on network architecture search approach, where there are multiple parallel edges (different types of NN operations) between two computation nodes.

How would I start in this case ?

buttercutter commented 2 years ago

@tjruwase could you guide on how to properly wrap the forward function caller around deepspeed activation check pointing code ?

buttercutter commented 1 year ago

@tjruwase

Since custom forward function with activation checkpointing is a bit difficult to implement in my code, would gradient accumulation trick helps instead ?

This jax code is modified from original huggingface version

        if len(accumulated_gradients) < training_args.gradient_accumulation_steps:
            accumulated_gradients.append(grad)
            new_state = state
        else:
            grad = jax.tree_multimap(lambda *x: jnp.sum(jnp.stack(x), axis=0), *accumulated_gradients)
            new_state = state.apply_gradients(grads=grad, dropout_rng=new_dropout_rng)
            accumulated_gradients = []

image

tjruwase commented 1 year ago

Yes, you can also use gradient accumulation, but it won't get the same throughput as a larger batch size. You should add gradient_accumulation_steps to ds_config, as disccused here.

buttercutter commented 1 year ago

@tjruwase Gradient accumulation trick is going to use up more memory, so it will not solve the OOM issue.

I still need to code the custom forward function for enabling activation checkpointing.

buttercutter commented 1 year ago

@tjruwase

If you look at the following actual forward function that I need to modify or customize for gradient checkpointing, you would notice that it does not really return something similar to your hidden_states, moe_losses tuple.

So, how would I make use of mpu.checkpoint() function in this particular case ?

Is torch.utils.checkpoint alright with checkpointing using values other than losses ?

     # self-defined initial NAS architecture, for supernet architecture edge weight training
     def forward_edge(self, x):
         self.__freeze_f()
         self.__unfreeze_w()

         # Refer to GDAS equations (5) and (6)
         # if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
         # It's not required, but you don't know, which index is one hot encoded 1.
         # https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
         # See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163

         gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
         chosen_edge = torch.argmax(gumbel, dim=0)  # converts one-hot encoding into integer

         return chosen_edge