Open buttercutter opened 2 years ago
DeepSpeed's memory optimizations are enabled via ZeRO stages >=1 and most of them are effective in multi-gpu cases. Your log shows that ZeRO is not enabled.
Please see these tutorials
[2022-09-09 15:01:50,764] [INFO] [logging.py:68:log_dist] [Rank 0] step=1248, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:50,765] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=14.454011142357578, CurrSamplesPerSec=9.851701418075939, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:51,547] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:51,547] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=14.449232505063256, CurrSamplesPerSec=10.233603997501561, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
[2022-09-09 15:01:52,349] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:01:52,349] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=14.444045524931788, CurrSamplesPerSec=9.977597824784409, MemAllocated=0.08GB, MaxMemAllocated=0.08GB
Traceback (most recent call last):
File "gdas.py", line 947, in <module>
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
File "gdas.py", line 737, in train_NN
torch.save(graph, path)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 589, in _save
pickler.dump(obj)
AttributeError: Can't pickle local object 'instrument_w_nvtx.<locals>.wrapped_fn'
[6eb301421b63:00906] *** Process received signal ***
[6eb301421b63:00906] Signal: Segmentation fault (11)
[6eb301421b63:00906] Signal code: Address not mapped (1)
[6eb301421b63:00906] Failing at address: 0x7f4e4f16420d
[6eb301421b63:00906] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f4e52011980]
[6eb301421b63:00906] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f4e51c50775]
[6eb301421b63:00906] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f4e524bbe44]
[6eb301421b63:00906] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f4e51c51605]
[6eb301421b63:00906] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f4e524b9cb3]
[6eb301421b63:00906] *** End of error message ***
I have updated the optimization pass as follows, however I am getting another different error related to pickler
above.
{
"train_micro_batch_size_per_gpu": 8,
"steps_per_print": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.05
}
},
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}
}
}
@buttercutter, can you please try stage == 1
? We can return to this stage 3
issue later.
@tjruwase "stage": 1
resulted in the following out-of-memory issue
[2022-09-09 15:45:39,534] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.891309503059055, CurrSamplesPerSec=12.814082860076859, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,142] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,142] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.87964277609462, CurrSamplesPerSec=13.148494490509256, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-09 15:45:40,755] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-09 15:45:40,755] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.867808169940616, CurrSamplesPerSec=13.059329395683946, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
Finished train_NN()
Traceback (most recent call last):
File "gdas.py", line 955, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 780, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 553, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 439, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 317, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 274, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 164, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 149, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6eb301421b63:01205] *** Process received signal ***
[6eb301421b63:01205] Signal: Segmentation fault (11)
[6eb301421b63:01205] Signal code: Address not mapped (1)
[6eb301421b63:01205] Failing at address: 0x7f556779a20d
[6eb301421b63:01205] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f556a647980]
[6eb301421b63:01205] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f556a286775]
[6eb301421b63:01205] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f556aaf1e44]
[6eb301421b63:01205] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f556a287605]
[6eb301421b63:01205] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f556aaefcb3]
[6eb301421b63:01205] *** End of error message ***
@buttercutter, got it. This out-of-memory with stage 1
is quite strange because the model seems to train for a while, completing multiple iterations, before failing. The compute and memory pattern of model training is typically regular so that any out-of-memory issues should show up within one or two iterations. Is it possible that you have a memory leak? Can you please use deepspeed memory profiler as discussed here? Thanks!
@tjruwase From the code link you provided just above,
forward_step_func
? I suppose this is not just simply the model wrapper as it feeds in args.curriculum_learning
?from megatron import get_timers
? Since this is a timer, does this actually log the memory profile for each backpropagation paths ?forward_step_func
is just the forward pass call. Actually, it could be any function that are you interested in how it affects memory. see_memory_usage()
is commented out in the code snippet.
Let's say you had a function foo
, we can profile memory usage like below.
from deepspeed.runtime.utils import see_memory_usage
see_memory_usage(f'memory usage before foo', force=True)
foo(...)
see_memory_usage(f'memory usage after foo', force=True)
`
[phung@archlinux gdas]$ git status --short
M ds_config.json
M gdas.py
[phung@archlinux gdas]$ git diff gdas.py
diff --git a/gdas.py b/gdas.py
index 10c095c..e6a8b5c 100644
--- a/gdas.py
+++ b/gdas.py
@@ -13,6 +13,9 @@ import tensorflow as tf
# import numpy as np
+VISUALIZER = 0
+DEBUG = 0
+
# deepspeed zero offload https://www.deepspeed.ai/getting-started/
# https://github.com/microsoft/DeepSpeed/issues/2029
USE_DEEPSPEED = 1
@@ -21,8 +24,8 @@ if USE_DEEPSPEED:
import argparse
import deepspeed
-VISUALIZER = 0
-DEBUG = 0
+ from deepspeed.runtime.utils import see_memory_usage
+
logdir = 'runs/gdas_experiment_1'
if VISUALIZER:
@@ -944,7 +947,10 @@ if __name__ == "__main__":
while not_converged:
print("run_num = ", run_num)
+ see_memory_usage(f'memory usage before train_NN()', force=True)
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
+ see_memory_usage(f'memory usage after train_NN()', force=True)
+
print("Finished train_NN()")
if VISUALIZER or DEBUG:
[phung@archlinux gdas]$
[phung@archlinux gdas]$ git diff ds_config.json
diff --git a/ds_config.json b/ds_config.json
index 91943c3..4afc33d 100644
--- a/ds_config.json
+++ b/ds_config.json
@@ -1,10 +1,27 @@
{
- "train_micro_batch_size_per_gpu": 8,
-"steps_per_print": 1,
- "optimizer": {
+ "train_micro_batch_size_per_gpu": 8,
+ "steps_per_print": 1,
+ "optimizer": {
"type": "AdamW",
"params": {
"lr": 0.05
}
- }
+ },
+
+ "zero_optimization": {
+ "stage": 1,
+ "contiguous_gradients": true,
+ "stage3_max_live_parameters": 1e9,
+ "stage3_max_reuse_distance": 1e9,
+ "stage3_prefetch_bucket_size": 1e7,
+ "stage3_param_persistence_threshold": 1e5,
+ "reduce_bucket_size": 1e7,
+ "sub_group_size": 1e9,
+ "offload_optimizer": {
+ "device": "cpu"
+ },
+ "offload_param": {
+ "device": "cpu"
+ }
+ }
}
[phung@archlinux gdas]$
@tjruwase With the above modifications to gdas.py
training source code as well as the deepspeed json config file, I have the following terminal output log which is contained inside gdas.ipynb.zip. May I ask if the following log implies any hints of the locations of the memory leak, if there are ?
Loading extension module utils...
Time to load utils op: 0.338411808013916 seconds
Rank: 0 partition count [1] and sizes[(48330, False)]
[2022-09-10 02:29:09,959] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-09-10 02:29:09,959] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:09,960] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,114] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2022-09-10 02:29:10,114] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,115] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,115] [INFO] [stage_1_and_2.py:516:__init__] optimizer state initialized
[2022-09-10 02:29:10,261] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-09-10 02:29:10,262] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,262] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-09-10 02:29:10,263] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,267] [INFO] [config.py:987:print] DeepSpeedEngine configuration:
[2022-09-10 02:29:10,268] [INFO] [config.py:991:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
run_num = 0
[2022-09-10 02:29:10,422] [INFO] [utils.py:8
[gdas.ipynb.zip](https://github.com/microsoft/DeepSpeed/files/9539457/gdas.ipynb.zip)
27:see_memory_usage] memory usage before train_NN()
[2022-09-10 02:29:10,423] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.04 GB Max_CA 0 GB
[2022-09-10 02:29:10,423] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 3.94 GB, percent = 31.1%
[2022-09-10 02:29:10,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=1, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,863] [INFO] [logging.py:68:log_dist] [Rank 0] step=2, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [logging.py:68:log_dist] [Rank 0] step=3, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:10,988] [INFO] [timer.py:207:stop] 0/3, RunningAvgSamplesPerSec=64.29764247567839, CurrSamplesPerSec=64.29764247567839, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,106] [INFO] [logging.py:68:log_dist] [Rank 0] step=4, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,107] [INFO] [timer.py:207:stop] 0/4, RunningAvgSamplesPerSec=66.00525611771185, CurrSamplesPerSec=67.80604576253033, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,225] [INFO] [logging.py:68:log_dist] [Rank 0] step=5, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:29:11,225] [INFO] [timer.py:207:stop] 0/5, RunningAvgSamplesPerSec=66.55657751974118, CurrSamplesPerSec=67.68731983531258, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:29:11,355] [INFO] [logging.py:68:log_dist] [Rank 0] step=6, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:54,910] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=21.507489154176568, CurrSamplesPerSec=12.836937339152996, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:55,527] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:55,528] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=21.496126966776316, CurrSamplesPerSec=12.963147028644192, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,185] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 02:36:56,186] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=21.482932504893128, CurrSamplesPerSec=12.168760408265106, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 02:36:56,993] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 02:36:56,993] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.1 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 02:36:56,994] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.19 GB, percent = 33.0%
Finished train_NN()
@tjruwase Sorry that I had put see_memory_usage
for the wrong function call in the previous message.
It should be for train_architecture()
instead of train_NN()
.
[2022-09-10 03:20:34,729] [INFO] [timer.py:207:stop] 0/1248, RunningAvgSamplesPerSec=22.113168856352097, CurrSamplesPerSec=13.503006889446912, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,340] [INFO] [logging.py:68:log_dist] [Rank 0] step=1249, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,340] [INFO] [timer.py:207:stop] 0/1249, RunningAvgSamplesPerSec=22.100969229749683, CurrSamplesPerSec=13.097596801095447, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:35,937] [INFO] [logging.py:68:log_dist] [Rank 0] step=1250, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-09-10 03:20:35,938] [INFO] [timer.py:207:stop] 0/1250, RunningAvgSamplesPerSec=22.089474650856374, CurrSamplesPerSec=13.399278091939252, MemAllocated=0.04GB, MaxMemAllocated=0.1GB
[2022-09-10 03:20:36,715] [INFO] [utils.py:827:see_memory_usage] memory usage after train_NN()
[2022-09-10 03:20:36,716] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.1 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 03:20:36,716] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Finished train_NN()
[2022-09-10 03:20:36,847] [INFO] [utils.py:827:see_memory_usage] memory usage before train_architecture()
[2022-09-10 03:20:36,847] [INFO] [utils.py:832:see_memory_usage] MA 0.04 GB Max_MA 0.04 GB CA 0.1 GB Max_CA 0 GB
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
File "gdas.py", line 962, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 783, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 556, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 442, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 320, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 277, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 167, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[a31af820d2a8:01199] *** Process received signal ***
[a31af820d2a8:01199] Signal: Segmentation fault (11)
[a31af820d2a8:01199] Signal code: Address not mapped (1)
[a31af820d2a8:01199] Failing at address: 0x7f3db597220d
[a31af820d2a8:01199] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f3db8416980]
[a31af820d2a8:01199] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f3db8055775]
[a31af820d2a8:01199] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f3db88c0e44]
[a31af820d2a8:01199] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f3db8056605]
[a31af820d2a8:01199] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f3db88becb3]
[a31af820d2a8:01199] *** End of error message ***
Unfortunately, that is not very useful. Can you please add "memory_breakdown": true
into your json config?
Attached : gdas.ipynb.zip
@tjruwase See the last column of the above ipynb code file which contains a very long log pertaining to "memory_breakdown": true
@tjruwase if you scroll to the very bottom of the ipynb file, the log output seems to imply that the gumbel
function might had contributed to the issue, although I am not quite sure how this is possible. Is there a way to go around this ?
@tjruwase Could I understand that deepspeed is unable to further optimize the gumbel function ?
# self-defined initial NAS architecture, for supernet architecture edge weight training
def forward_edge(self, x):
self.__freeze_f()
self.__unfreeze_w()
# Refer to GDAS equations (5) and (6)
# if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
# It's not required, but you don't know, which index is one hot encoded 1.
# https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
# See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
chosen_edge = torch.argmax(gumbel, dim=0) # converts one-hot encoding into integer
return chosen_edge
@buttercutter, I don't understand what gumbel
is doing. So, you can please clarify your expectations for DeepSpeed in this case, and what you mean by "further optimize"? Thanks!
@tjruwase Sorry for the wording disambiguation, it should be memory offloading to cpu
instead of further optimize
.
Note: This is not some corporate project and it is just some homebrew project, so I am not asking for some deep involvement at your side. I just need to find out what is wrong with my own code, and if scenario requires, I might need your slight guidance in debugging the interface with my own code and deepspeed, or beyond.
[2022-09-10 03:20:36,847] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.21 GB, percent = 33.2%
Traceback (most recent call last):
File "gdas.py", line 962, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 783, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 556, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 442, in forward
self.nodes[n].forward(x, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 320, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 277, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 167, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)
@tjruwase In my own understanding, gumbel-softmax is to allow the gradient to backpropagate across discrete domain.
@buttercutter, I think the issue is that ZeRO does not optimize the memory consumption of activations. Can you try running with a train_micro_batch_size_per_gpu
of 1 to test this? Also, assuming you are now comfortable with using see_memory_usage()
, can add it just before the gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
call?
@buttercutter, you might try activation checkpointing to address this if the smaller micro batch size works. Here are some docs
@tjruwase if I use train_micro_batch_size_per_gpu
of 1, I have the following error which had been spotted, reported and resolved earlier in another previous github issue
Traceback (most recent call last):
File "gdas.py", line 947, in <module>
ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
File "gdas.py", line 690, in train_NN
Ltrain = criterion(NN_output, NN_train_labels)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
label_smoothing=self.label_smoothing)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).
@buttercutter, yes this is not surprising as it is indicating a mismatch between the batch size passed to client script via command line and the batch size in the ds_config. In the previous case, it made sense to modify the ds_config value of 1
to match the command line value of 8
. But now, since we are trying to profile the memory behavior without encountering out-of-memory errors, so it makes sense to modify the command line value to 1
to match the ds_config. Does that make sense?
@tjruwase ok, I modified the value of batch_size for both the ds_config as well as the client script to 1
for testing purpose. So far, the GPU RAM memory usage stays around 4GB.
However, I am confused as in why would batch_size needs to be reduced in order to solve CUDA out-of-memory error when the primary purpose of deepspeed package is to offload RAM memory usage from GPU to CPU ?
@buttercutter, thanks for the update.
No, we are not solving the problem by reducing the batch_size rather we are trying to confirm whether memory bloat is one that ZeRO is designed to solve. There are two major sources of memory consumption in model training
ZeRO is designed to solve (1) but not (2). For (2), you need to use activation checkpointing like the links I shared earlier. Here are some next steps that could share more insight:
see_memory_usage()
to track memory utilization at vital points, such as before/after forward and before/after backward for the batch sizes that don't cause out-of-memory errors. @tjruwase Regarding using checkpointing for trading compute with memory, should I only use the deepspeed version ?
and how would I perform checkpointing.configure() for this particular gumbel-max function : gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
?
as for the DeepSpeed Config json file, how shall I properly modify the following params ?
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
@buttercutter, yes I recommend DeepSpeed's activation checkpointing because if supports offloading the activation inputs to cpu memory. In terms of enabling it, there are two parts involved.
checkpoint()
as done in this file. In your case, I guess you want to wrap a caller of this. However, I think activation checkpointing is the last of the 3 steps that I proposed. In particular, how to effectively use it will depend on the findings from the first two steps, so I recommend completing those first. Also, activation checkpointing is not easy to implement so I suggest first testing out the Megatron-DeepSpeed implementation that I referenced earlier to understand how it works.
@tjruwase I used batch_size
of 4, and it also resulted in memory usage of around 4GB with the help from see_memory_usage()
on gumbel-softmax function, what do you suggest in this particular case before proceeding to using activation checkpointing ?
[2022-09-29 19:47:32,987] [INFO] [utils.py:827:see_memory_usage] memory usage before gumbel
[2022-09-29 19:47:32,988] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:32,988] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0%
[2022-09-29 19:47:33,142] [INFO] [utils.py:827:see_memory_usage] memory usage after gumbel
[2022-09-29 19:47:33,142] [INFO] [utils.py:832:see_memory_usage] MA 0.61 GB Max_MA 0.61 GB CA 0.68 GB Max_CA 1 GB
[2022-09-29 19:47:33,143] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.44 GB, percent = 35.0%
@tjruwase if I stick with batch_size
of 4, the training process runs smoothly without cuda out-of-memory issue so far, but of course the training time is much longer compared to batch_size
of 8, 16 or 32. So, I suppose it is an incentive to use activation checkpointing in this particular case ? could you advise on this ?
@tjruwase it is a bit strange that see_memory_usage()
returned the same memory usage for both before and after gumbel_softmax()
function
What do you suggest then ? See the following:
File "gdas.py", line 152, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.50 GiB already allocated; 3.75 MiB free; 13.75 GiB reserved in total by PyTorch)
@buttercutter, yes it is quite mysterious to get the same memory usage before and after gumbel_softmax
. This suggests that something (e.g., a kernel) is allocating memory outside the view of the cuda memory allocator. If you are willing to be adventurous, I suggest more finer-grained memory profiling by wrapping the following line with see_memory_usage
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1895, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
@tjruwase see the following log which seems not that useful for further debugging
[2022-11-09 04:45:52,214] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,215] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,215] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,345] [INFO] [utils.py:827:see_memory_usage] memory usage after variable ret
[2022-11-09 04:45:52,346] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,346] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
[2022-11-09 04:45:52,484] [INFO] [utils.py:827:see_memory_usage] memory usage before variable ret
[2022-11-09 04:45:52,485] [INFO] [utils.py:832:see_memory_usage] MA 12.34 GB Max_MA 12.34 GB CA 13.71 GB Max_CA 14 GB
[2022-11-09 04:45:52,485] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.63 GB, percent = 36.5%
Traceback (most recent call last):
File "gdas.py", line 960, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 785, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 558, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 449, in forward
self.nodes[n].forward(x2, node_num=n, types=types) # Ltrain(w±, alpha)
File "gdas.py", line 322, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 279, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 169, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 154, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1899, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 15.75 MiB free; 13.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[3a98c6920156:00675] *** Process received signal ***
[3a98c6920156:00675] Signal: Segmentation fault (11)
[3a98c6920156:00675] Signal code: Address not mapped (1)
[3a98c6920156:00675] Failing at address: 0x7f7953ad920d
[3a98c6920156:00675] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f7956783980]
[3a98c6920156:00675] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f79563c2775]
[3a98c6920156:00675] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f7956c2de44]
[3a98c6920156:00675] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f79563c3605]
[3a98c6920156:00675] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f7956c2bcb3]
[3a98c6920156:00675] *** End of error message ***
The log snippet suggests that this code passes the 1st time but fails the 2nd time. Is that correct?
@tjruwase Not correct, the log printout from see_memory_usage()
actually happened for both forward pass and backward pass.
May I ask how shall I modify the debug printout flow such that it only logs printout from see_memory_usage()
ONLY for backward pass ?
I see, thanks for the clarification. I thought the 2nd was also forward pass because of multiple forward*
and references in the stack trace, as well as the softmax
computation. So, does the backward pass also include a forward pass that causes this stack trace? I am trying to get a mental model of the execution flow as the model is behaving differently from my experience.
You don't need to restrict see_memory_usage
for backwrad pass only. Also, can you share equivalent log snippet of the smaller micro-batch size that runs without OOM?
Also, can you please share the memory usage for batch sizes 1, 2, and 4, so we can understand any pattern from increasing batch size? Thanks!
You are right about those 2 see_memory_usage()
in the log snippet above (which is using batch size of 256) were being executed during forward pass.
The reason I asked for see_memory_usage()
to be only in the backwards pass is because logging printf()
takes a very huge toll on overall running time for forward pass, and I am using limited computing unit inside google cloud colab session. So extra runtime will cost money in this case.
Could you advise about limiting see_memory_usage()
to be just executed inside backwards pass because even smaller batch size of 2 or 4 will only worsen this extremely long runtime issue when see_memory_usage()
is enabled for every forward pass ?
You can call see_memory_usage()
while setting force
to True only for backward.
I googled but there is no specific built-in PyTorch variable that could actually differentiate forward and backward pass during runtime execution ?
Please correct me if I miss anything,
Can you define you own global variable that you set/clear around the backward call?
I followed your suggestion on using global variable, see the log below:
Here is the exact source files used to obtain the following log snippet
[2022-11-11 08:40:17,899] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.08 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:17,899] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,036] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:18,037] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:18,037] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,049] [INFO] [logging.py:68:log_dist] [Rank 0] step=2498, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:18,050] [INFO] [timer.py:207:stop] 0/2498, RunningAvgSamplesPerSec=3.066608367950589, CurrSamplesPerSec=2.2201241199743493, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:18,191] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:18,191] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:18,191] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,328] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:18,329] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:18,330] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,554] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:18,555] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:18,555] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:18,703] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:18,704] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:18,704] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:19,893] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:19,893] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.08 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:19,894] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,024] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:20,025] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:20,025] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,036] [INFO] [logging.py:68:log_dist] [Rank 0] step=2499, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:20,037] [INFO] [timer.py:207:stop] 0/2499, RunningAvgSamplesPerSec=3.066099503151551, CurrSamplesPerSec=2.168111862121388, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:20,174] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:20,175] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:20,175] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,308] [INFO] [utils.py:827:see_memory_usage] Engine before forward
[2022-11-11 08:40:20,309] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:20,309] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,531] [INFO] [utils.py:827:see_memory_usage] Engine after forward
[2022-11-11 08:40:20,532] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:20,532] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:20,666] [INFO] [utils.py:827:see_memory_usage] Engine before backward
[2022-11-11 08:40:20,667] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:20,667] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,834] [INFO] [utils.py:827:see_memory_usage] Engine after backward
[2022-11-11 08:40:21,835] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.08 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:21,835] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,971] [INFO] [utils.py:827:see_memory_usage] Engine before step
[2022-11-11 08:40:21,972] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:21,972] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
[2022-11-11 08:40:21,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=2500, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-11-11 08:40:21,983] [INFO] [timer.py:207:stop] 0/2500, RunningAvgSamplesPerSec=3.0656260018592953, CurrSamplesPerSec=2.2124665882548116, MemAllocated=0.02GB, MaxMemAllocated=0.02GB
[2022-11-11 08:40:22,127] [INFO] [utils.py:827:see_memory_usage] Engine after step
[2022-11-11 08:40:22,128] [INFO] [utils.py:832:see_memory_usage] MA 0.02 GB Max_MA 0.02 GB CA 0.08 GB Max_CA 0 GB
[2022-11-11 08:40:22,128] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory: used = 4.07 GB, percent = 32.1%
Finished train_NN()
Traceback (most recent call last):
File "gdas.py", line 963, in <module>
lval = train_architecture(graph=graph_, model_engine=model_engine_, forward_pass_only=0, train_or_val='val')
File "gdas.py", line 788, in train_architecture
graph.forward(val_inputs, types="edge") # Lval(w*, alpha)
File "gdas.py", line 559, in forward
self.cells[c].forward(x, x1, x2, c, types=types)
File "gdas.py", line 449, in forward
self.nodes[n].forward(x1, node_num=n, types=types) + \
File "gdas.py", line 323, in forward
y = self.connections[cc].forward(x, types)
File "gdas.py", line 280, in forward
edges_results = edges_results + self.edges[e].forward(x, types)
File "gdas.py", line 170, in forward
y_hat = self.forward_edge(x)
File "gdas.py", line 155, in forward_edge
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1900, in gumbel_softmax
ret = y_hard - y_soft.detach() + y_soft
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.51 GiB already allocated; 3.75 MiB free; 13.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[6b0e6168b82c:01203] *** Process received signal ***
[6b0e6168b82c:01203] Signal: Segmentation fault (11)
[6b0e6168b82c:01203] Signal code: Address not mapped (1)
[6b0e6168b82c:01203] Failing at address: 0x7fbd6e68520d
[6b0e6168b82c:01203] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fbd71530980]
[6b0e6168b82c:01203] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7fbd7116f775]
[6b0e6168b82c:01203] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7fbd719dae44]
[6b0e6168b82c:01203] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7fbd71170605]
[6b0e6168b82c:01203] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7fbd719d8cb3]
[6b0e6168b82c:01203] *** End of error message ***
Thanks for the sharing the log. I have a few questions
Finished train_NN()
gc.collect()
torch.cuda.empty_cache()
I have attached all the relevant source files here
batch size = 4
Yes, ran smoothly for train_NN() before train_architecture()
. Remember this is a bi-level optimization (See algorithm 1).
Do we need to take care of the log snippet before Finished train_NN()
since the issue is inside train_architecture()
which is executed after Finished train_NN()
?
Let me do as you suggested. Might I ask if this is to be put inside /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py
? Is it to be put before or after ret = y_hard - y_soft.detach() + y_soft
?
@tjruwase
If you checked the modified gdas.py with your suggestion with the CUDA oom error log , can I imply that it is not due to memory fragmentation as you had suggested earlier ?
Please advise.
Thanks for sharing the code and log. I agree that this is probably not a memory fragmentation issue. Also, you can remove the logging in train_NN()
as that runs fine and the failure is in train_architecture()
.
Also, I think the following log that you collected from gumbel_softmax()
might have the explanation; sorry for not realizing this earlier
Let's also combine with the OOM message:
There are two things to notice here:
Max_CA
is 14GB before the code.This is already an OOM condition that was lucky not to have failed earlier, because any attempt to allocate even a small amount of intermediate memory, like 20MB in this case is sufficient to cause a failure.
I don't yet see the cause of this high memory usage. But it seems to me that the memory profile of train_NN()
is very different from train_architecture()
, do you agree?
I cannot comment on the memory pattern for train_architecture()
because its execution path was terminated due to oom in its first epoch.
train_NN()
itself already consumed about 14GB of GPU memory, strange. Is it not the purpose of deepspeed to make sure that any function execution path will not consume this much GPU memory ?
Please suggest any possible solution in this particular case.
I think you misunderstood my point, so let me restate:
MA
while memory consumed by PyTorch cache allocator is Max_CA
. Please see here for further discussion of CUDA memory management. train_NN()
consumes just 0.2GB
, see the screenshot below of the log at the end of train_NN()
train_architecture()
just before crash is 12.34GB
. Since this happening during forward
it indicates activation memory consumption.
Hope that is helpful.
In this case then, I will proceed with the use of activation checkpointing.
It seems that you are right about the difficulty of implementing custom forward function with activation checkpointing given that my current model architecture is based on network architecture search approach, where there are multiple parallel edges (different types of NN operations) between two computation nodes.
How would I start in this case ?
@tjruwase could you guide on how to properly wrap the forward function caller around deepspeed activation check pointing code ?
@tjruwase
Since custom forward function with activation checkpointing is a bit difficult to implement in my code, would gradient accumulation trick helps instead ?
This jax
code is modified from original huggingface version
if len(accumulated_gradients) < training_args.gradient_accumulation_steps:
accumulated_gradients.append(grad)
new_state = state
else:
grad = jax.tree_multimap(lambda *x: jnp.sum(jnp.stack(x), axis=0), *accumulated_gradients)
new_state = state.apply_gradients(grads=grad, dropout_rng=new_dropout_rng)
accumulated_gradients = []
Yes, you can also use gradient accumulation, but it won't get the same throughput as a larger batch size. You should add gradient_accumulation_steps
to ds_config, as disccused here.
@tjruwase Gradient accumulation trick is going to use up more memory, so it will not solve the OOM issue.
I still need to code the custom forward function for enabling activation checkpointing.
@tjruwase
If you look at the following actual forward function that I need to modify or customize for gradient checkpointing, you would notice that it does not really return something similar to your hidden_states, moe_losses
tuple.
So, how would I make use of mpu.checkpoint() function in this particular case ?
Is torch.utils.checkpoint alright with checkpointing using values other than losses ?
# self-defined initial NAS architecture, for supernet architecture edge weight training
def forward_edge(self, x):
self.__freeze_f()
self.__unfreeze_w()
# Refer to GDAS equations (5) and (6)
# if one_hot is already there, would summation be required given that all other entries are forced to 0 ?
# It's not required, but you don't know, which index is one hot encoded 1.
# https://pytorch.org/docs/stable/nn.functional.html#gumbel-softmax
# See also https://github.com/D-X-Y/AutoDL-Projects/issues/10#issuecomment-916619163
gumbel = F.gumbel_softmax(x, tau=TAU_GUMBEL, hard=True)
chosen_edge = torch.argmax(gumbel, dim=0) # converts one-hot encoding into integer
return chosen_edge
May I know why this training code still gives CUDA-out-of-memory issue even after DeepSpeed is turned on ?
See this for historical tracking purpose.