huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 938 forks source link

Accelerateor .prepare() doubles vRAM usage #2802

Closed marvingabler closed 2 months ago

marvingabler commented 4 months ago

System Info

- `Accelerate` version: 0.23.0
- Platform: Linux-5.15.0-102-generic-x86_64-with-glibc2.35
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 944.84 GB
- GPU type: NVIDIA H100 80GB HBM3

Information

Tasks

Reproduction

With the following code, I observe almost double the vram usage when using accelerate vs pure pytorch:

def test_Swin():
    if torch.cuda.is_available():
        device = torch.device("cuda")

    embd_dim = 1024

    model = Swin(embed_dim=embd_dim, window_size=18, num_heads=16, depth=36).to(device).half()
    x = torch.randn(1, 70, 2, 721, 1440).to(device).half()
    # check memory usage
    print(torch.cuda.memory_allocated(device) / 1024**3)
    y = model(x)
    print(torch.cuda.memory_allocated(device) / 1024**3)
    # reset grad
    model.zero_grad()
    with torch.no_grad():
        y = model(x)
    # check memory usage
    print("without grad")
    print(torch.cuda.memory_allocated(device) / 1024**3)
    print(y.shape)

    # emulate training loop
    optim = torch.optim.Adam(model.parameters(), lr=1e-4)
    print("training loop without accelerate")
    for i in range(20):
        y = model(x)
        print(torch.cuda.memory_allocated(device) / 1024**3)
        loss = y.sum()
        loss.backward()
        optim.step()
        if i % 10 == 0:
            print(i)
            print(torch.cuda.memory_allocated(device) / 1024**3)
        model.zero_grad()

    # make training loop with accelerator
    from accelerate import Accelerator

    accelerator = Accelerator(mixed_precision="fp16")
    model = Swin(embed_dim=embd_dim, window_size=18, num_heads=16)
    optim = torch.optim.Adam(model.parameters(), lr=1e-4)
    x = torch.randn(1, 70, 2, 721, 1440).half().to(accelerator.device)

    model, optim, x = accelerator.prepare(model, optim, x)
    print("training loop with accelerator")
    for i in range(20):
        y = model(x)
        print(torch.cuda.memory_allocated(device) / 1024**3)
        loss = y.sum()
        accelerator.backward(loss)
        optim.step()
        if i % 10 == 0:
            print(i)
            print(torch.cuda.memory_allocated(device) / 1024**3)
        optim.zero_grad()

My stdout yields:

training loop without accelerate
35.66161823272705
0
4.496548175811768
37.577457904815674
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
10
4.496371746063232
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
37.57687854766846
input resolution [90, 180]
training loop with accelerator
69.56421709060669
0
6.065060138702393
71.07300519943237
71.07197523117065
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
10
6.0643181800842285
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607
71.0721983909607

Which is almost twice the ram usage. Why is the vram inflating? What am I missing?

Expected behavior

Comparable RAM usage

muellerzr commented 4 months ago

You're never clearing the vram after your first run... try splitting the accelerate and non accelerate versions into different functions, and call each independently after doing torch.cuda.empty_cache() + del all refs + gc.collect()

marvingabler commented 4 months ago

Thanks for the quick answer! However, it doesn't make a difference, I get the same behaviour after calling the prepare() (wrote the above just for reference), Here is the updated test func:

def test_Swin():
    if torch.cuda.is_available():
        device = torch.device("cuda")

    embed_dim = 1024
    grad_checkpointing = False

    print("Testing Swin")
    print("embed_dim", embed_dim)
    print("grad_checkpointing", grad_checkpointing)

    model = (
        Swin(
            embed_dim=embed_dim,
            window_size=18,
            num_heads=16,
            grad_checkpointing=grad_checkpointing,
        )
        .to(device)
        .half()
    )
    x = torch.randn(1, 70, 2, 721, 1440).to(device).half()
    # check memory usage
    print(torch.cuda.memory_allocated(device) / 1024**3)
    y = model(x)
    print(torch.cuda.memory_allocated(device) / 1024**3)
    # reset grad
    model.zero_grad()
    with torch.no_grad():
        y = model(x)
    # check memory usage
    print("without grad")
    print(torch.cuda.memory_allocated(device) / 1024**3)
    print(y.shape)

    # emulate training loop
    optim = torch.optim.Adam(model.parameters(), lr=1e-4)
    print("training loop")
    for i in range(20):
        y = model(x)
        print(torch.cuda.memory_allocated(device) / 1024**3)
        loss = y.sum()
        loss.backward()
        optim.step()
        if i % 10 == 0:
            print(i)
            print(torch.cuda.memory_allocated(device) / 1024**3)
        model.zero_grad()

    del model, x, y, optim
    torch.cuda.empty_cache()
    import gc
    gc.collect()

    # make training loop with accelerator
    from accelerate import Accelerator

    accelerator = Accelerator(mixed_precision="fp16")
    model = Swin(
        embed_dim=embed_dim,
        window_size=18,
        num_heads=16,
        grad_checkpointing=grad_checkpointing,
    )
    optim = torch.optim.Adam(model.parameters(), lr=1e-4)
    x = torch.randn(1, 70, 2, 721, 1440).half().to(accelerator.device)

    model, optim, x = accelerator.prepare(model, optim, x)
    print("training loop with accelerator")
    for i in range(20):
        y = model(x)
        print(torch.cuda.memory_allocated(device) / 1024**3)
        loss = y.sum()
        accelerator.backward(loss)
        optim.step()
        if i % 10 == 0:
            print(i)
            print(torch.cuda.memory_allocated(device) / 1024**3)
        optim.zero_grad()

if __name__ == "__main__":
    if torch.cuda.is_available():
        torch.cuda.set_device(3)
        device = torch.device("cuda")

    test_SwinUnet()

With output

Testing SwinUnet
embed_dim 1024
grad_checkpointing False
1.7816967964172363
46.906301975250244
without grad
1.9483180046081543
torch.Size([1, 70, 721, 1440])
training loop
46.911083698272705
0
5.691694736480713
49.38543939590454
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
10
5.69295597076416
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
49.384860038757324
training loop with accelerator
69.92961549758911
0
6.064304828643799
68.70936632156372
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
10
6.064639091491699
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
68.69727182388306
muellerzr commented 4 months ago

Well there's your issue:

model = (
        Swin(
            embed_dim=embed_dim,
            window_size=18,
            num_heads=16,
            grad_checkpointing=grad_checkpointing,
        )
        .to(device)
        .half()

HF never converts the full model to half.

And I do not recommend doing training in this way. (Once you go half, you can never go back to full precision weights). We autocast the gradients into half precision instead.

muellerzr commented 4 months ago

(I will run this again on Tuesday when I’m back in the office, but that’s my main hunch)

marvingabler commented 4 months ago

I see, makes totally sense! Guess thats it, thanks!

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.