mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

Fix dmoe tests GPU OOM #1216

Closed snarayan21 closed 4 months ago

snarayan21 commented 4 months ago

There seems to be an issue with GPU memory not being freed in between tests, and specifically, the torch dmoe tests are causing GPU OOM in private. Calling torch.cuda.empty_cache() is not helping since this appears to be non-releasable memory (see below), so apparently we're actually using some object(s) between each tests that are taking up a lot of memory. Tried to dig into the memory leak but decided to just cut out some extraneous test cases. This PR is to bring public foundry in line with changes in private here

Result of calling print(torch.cuda.memory_summary()) right before the last dmoe test:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Active memory         | 314368 B   | 320304 KiB |   6901 MiB |   6900 MiB |
|       from large pool |      0 B   | 315536 KiB |   6149 MiB |   6149 MiB |
|       from small pool | 314368 B   |  15636 KiB |    751 MiB |    751 MiB |
|---------------------------------------------------------------------------|
| Requested memory      | 255744 B   | 320295 KiB |   6884 MiB |   6884 MiB |
|       from large pool |      0 B   | 315536 KiB |   6137 MiB |   6137 MiB |
|       from small pool | 255744 B   |  15564 KiB |    747 MiB |    746 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   4096 KiB | 542720 KiB |   3188 MiB |   3184 MiB |
|       from large pool |      0 KiB | 536576 KiB |   3000 MiB |   3000 MiB |
|       from small pool |   4096 KiB |  16384 KiB |    188 MiB |    184 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   3789 KiB | 223737 KiB |   2031 MiB |   2027 MiB |
|       from large pool |      0 KiB | 221040 KiB |   1234 MiB |   1234 MiB |
|       from small pool |   3789 KiB |   8126 KiB |    796 MiB |    792 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     217    |     304    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     301    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| Active allocs         |     217    |     305    |   16041    |   15824    |
|       from large pool |       0    |       4    |     196    |     196    |
|       from small pool |     217    |     302    |   15845    |   15628    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |      11    |     180    |     178    |
|       from large pool |       0    |       3    |      86    |      86    |
|       from small pool |       2    |       8    |      94    |      92    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      33    |      40    |    9083    |    9050    |
|       from large pool |       0    |       2    |      95    |      95    |
|       from small pool |      33    |      40    |    8988    |    8955    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|