Closed marvingabler closed 2 months ago
You're never clearing the vram after your first run... try splitting the accelerate and non accelerate versions into different functions, and call each independently after doing torch.cuda.empty_cache()
+ del
all refs + gc.collect()
Thanks for the quick answer! However, it doesn't make a difference, I get the same behaviour after calling the prepare() (wrote the above just for reference), Here is the updated test func:
def test_Swin():
if torch.cuda.is_available():
device = torch.device("cuda")
embed_dim = 1024
grad_checkpointing = False
print("Testing Swin")
print("embed_dim", embed_dim)
print("grad_checkpointing", grad_checkpointing)
model = (
Swin(
embed_dim=embed_dim,
window_size=18,
num_heads=16,
grad_checkpointing=grad_checkpointing,
)
.to(device)
.half()
)
x = torch.randn(1, 70, 2, 721, 1440).to(device).half()
# check memory usage
print(torch.cuda.memory_allocated(device) / 1024**3)
y = model(x)
print(torch.cuda.memory_allocated(device) / 1024**3)
# reset grad
model.zero_grad()
with torch.no_grad():
y = model(x)
# check memory usage
print("without grad")
print(torch.cuda.memory_allocated(device) / 1024**3)
print(y.shape)
# emulate training loop
optim = torch.optim.Adam(model.parameters(), lr=1e-4)
print("training loop")
for i in range(20):
y = model(x)
print(torch.cuda.memory_allocated(device) / 1024**3)
loss = y.sum()
loss.backward()
optim.step()
if i % 10 == 0:
print(i)
print(torch.cuda.memory_allocated(device) / 1024**3)
model.zero_grad()
del model, x, y, optim
torch.cuda.empty_cache()
import gc
gc.collect()
# make training loop with accelerator
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="fp16")
model = Swin(
embed_dim=embed_dim,
window_size=18,
num_heads=16,
grad_checkpointing=grad_checkpointing,
)
optim = torch.optim.Adam(model.parameters(), lr=1e-4)
x = torch.randn(1, 70, 2, 721, 1440).half().to(accelerator.device)
model, optim, x = accelerator.prepare(model, optim, x)
print("training loop with accelerator")
for i in range(20):
y = model(x)
print(torch.cuda.memory_allocated(device) / 1024**3)
loss = y.sum()
accelerator.backward(loss)
optim.step()
if i % 10 == 0:
print(i)
print(torch.cuda.memory_allocated(device) / 1024**3)
optim.zero_grad()
if __name__ == "__main__":
if torch.cuda.is_available():
torch.cuda.set_device(3)
device = torch.device("cuda")
test_SwinUnet()
With output
Testing SwinUnet
embed_dim 1024
grad_checkpointing False
1.7816967964172363
46.906301975250244
without grad
1.9483180046081543
torch.Size([1, 70, 721, 1440])
training loop
46.911083698272705
0
5.691694736480713
49.38543939590454
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
10
5.69295597076416
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
training loop with accelerator
69.92961549758911
0
6.064304828643799
68.70936632156372
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
10
6.064639091491699
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
Well there's your issue:
model = (
Swin(
embed_dim=embed_dim,
window_size=18,
num_heads=16,
grad_checkpointing=grad_checkpointing,
)
.to(device)
.half()
HF never converts the full model to half.
And I do not recommend doing training in this way. (Once you go half, you can never go back to full precision weights). We autocast the gradients into half precision instead.
(I will run this again on Tuesday when I’m back in the office, but that’s my main hunch)
I see, makes totally sense! Guess thats it, thanks!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
With the following code, I observe almost double the vram usage when using accelerate vs pure pytorch:
My stdout yields:
Which is almost twice the ram usage. Why is the vram inflating? What am I missing?
Expected behavior
Comparable RAM usage