Pytorch Criteo CUDA error

priyakasimbeg commented 1 year ago

Branch: dev Test link: https://github.com/mlcommons/algorithmic-efficiency/actions/runs/5416731116/jobs/9846848568 For details expand criteo_pytorch and then expand Run Containerized Workload.

Description

Criteo Pytorch OOMs.

Traceback:

ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 3; 15.78 GiB total capacity; 12.14 GiB already allocated; 307.44 MiB free; 14.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps to Reproduce

On kasimbeg-3

docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_dev

  docker run  -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_dev -d criteo1tb -f pytorch -s baselines/adamw/pytorch/submission.py -w criteo1tb -t baselines/adamw/tuning_search_space.json -e test_today/adamw -m 10 -c False -o True -r false

priyakasimbeg commented 1 year ago

Update, I made sure I ran with --torch_compile flag and now the submission_runner crashes with CUDA error:misaligned address. Full trace back in https://github.com/mlcommons/algorithmic-efficiency/actions/runs/5534559953/jobs/10099946892 under criteo_pytorch > Run containerized workload.

runame commented 1 year ago

Update: I checked 1) that I can reproduce the same error with the reference_submission_tests.py and 2) what the largest batch size is that I can successfully run the workload with. It turns out it is 32768, which is exactly 1/8th of the previous training batch size -- a bit suspicious since we are using 8 GPUs.

Also, the OOM error occurs during the 2nd optimizer update step.

pomonam commented 1 year ago

Confirmed that the new commit does not fix the issue. I tried several things yesterday but did not fix the issue. These are the things I tried for bookkeeping:

Tried torch.cuda.empty_cache() and other suggestions in the thread here.
Setting fused=False for AdamW optimizer, as suggested in here
Deleting several caches after evaluations.

I will brainstorm some other possible ideas.

znado commented 1 year ago

Given that this input pipeline is pretty simple (just read some TSV files), would it be worth reimplementing it in a native pytorch pipeline and seeing if that maybe fixes the OOM issue?

runame commented 1 year ago

I think we should do some simple ablations first, since it did work recently without issues. @pomonam Have you tried running the code from the dev branch with PyTorch 1.13.1 instead of 2.0.1? To check if the issue is actually just due to the PyTorch update.

pomonam commented 1 year ago

Unfortunately, I did not have a chance to do that. @priyakasimbeg do you have a docker for PyTorch 1.13.1? Could you please point me out what docker image I should use? Thank you!

pomonam commented 1 year ago

Also, reading #217, it seems like the previous OOM was solved using "memory-efficient" NADAM. Could it be possible that we run into OOM when we use PyTorch ADAMW (which seems to be what we use for regression tests)?

runame commented 1 year ago

Good point that a different optimizer is used for the regression tests, but I did run my tests with the "memory-efficient" version of NAdamW, so that should not be the issue.

runame commented 1 year ago

Just to keep this thread up to date, the OOM error indeed disappears when we revert back to PyTorch 1.13.1.

msaroufim commented 1 year ago

cc @janeyx99 on the OOM optimizer issues

pomonam commented 1 year ago

I tracked down the issue and will send a fix soon (with a detailed explanation of why we had this OOM issue).

janeyx99 commented 1 year ago

Hi--I'd be happy to work on fixing OOMs in the optim step from the PyTorch side if there is something to be done. I had recently worked on a series of fixes to decrease memory usage of our default optimizers as of 2.0.0 so the newer Adam(W) in our nightlies should no longer use as much memory. (One can install the nightlies by following the toggle-thing here.)

However, I am curious if yall had tried fused=False (or also fused=True, which is more performant than even our default!). If fused=False did not OOM, it would make sense why moving back to 1.13 fixed the issue, as that was the default impl then. However, if fused=False did OOM but 1.13.1 version did not, then there may be another issue I should look into. Looking forward to @pomonam's explanation!

pomonam commented 1 year ago

Hi @janeyx99, thank you for the comment! We did try setting fused=False, and this solved the OOM issue for some other workloads but not for this one. In my previous comment, I thought I solved the issue, but when creating another docker and rerunning the code, it seems like the issue persists. I am still debugging this, but I thought it would be useful to share what I have found so far and get feedbacks (in case you have any).

The code runs fine for three update steps. But, we are facing OOM at the fourth update step:

Traceback (most recent call last):
  File "submission_runner.py", line 644, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 615, in main
    score = score_submission_on_workload(
  File "submission_runner.py", line 538, in score_submission_on_workload
    timing, metrics = train_once(workload, global_batch_size,
  File "submission_runner.py", line 300, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/algorithmic-efficiency/baselines/adamw/pytorch/submission.py", line 115, in update_params
    optimizer_state['optimizer'].step()
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/adamw.py", line 440, in _single_tensor_adamw
    denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 15.78 GiB total capacity; 12.10 GiB already allocated; 285.44 MiB free; 14.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Adding torch.cuda.empty_cache() after each gradient step solves the OOM issue. However, I am guessing this is not the ideal solution, as it can significantly slow down the training.

Here is the output of torch.cuda.memory_summary() just before the third optimizer_state['optimizer'].step() (which does not raise OOM):

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  10344 MiB |  14441 MiB |  45674 MiB |  35330 MiB |
|       from large pool |  10338 MiB |  14435 MiB |  45632 MiB |  35294 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  10344 MiB |  14441 MiB |  45674 MiB |  35330 MiB |
|       from large pool |  10338 MiB |  14435 MiB |  45632 MiB |  35294 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  10342 MiB |  14438 MiB |  45661 MiB |  35319 MiB |
|       from large pool |  10336 MiB |  14432 MiB |  45619 MiB |  35283 MiB |
|       from small pool |      5 MiB |      6 MiB |     42 MiB |     36 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  14526 MiB |  14526 MiB |  18218 MiB |   3692 MiB |
|       from large pool |  14518 MiB |  14518 MiB |  18210 MiB |   3692 MiB |
|       from small pool |      8 MiB |      8 MiB |      8 MiB |      0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  85786 KiB |   3622 MiB |  23010 MiB |  22926 MiB |
|       from large pool |  85699 KiB |   3620 MiB |  22965 MiB |  22881 MiB |
|       from small pool |     86 KiB |      3 MiB |     44 MiB |     44 MiB |
|---------------------------------------------------------------------------|
| Allocations           |      80    |      98    |     606    |     526    |
|       from large pool |      22    |      41    |     263    |     241    |
|       from small pool |      58    |      76    |     343    |     285    |
|---------------------------------------------------------------------------|
| Active allocs         |      80    |      98    |     606    |     526    |
|       from large pool |      22    |      41    |     263    |     241    |
|       from small pool |      58    |      76    |     343    |     285    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      15    |      28    |      31    |      16    |
|       from large pool |      11    |      25    |      27    |      16    |
|       from small pool |       4    |       4    |       4    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       9    |      19    |     204    |     195    |
|       from large pool |       6    |      12    |      96    |      90    |
|       from small pool |       3    |       8    |     108    |     105    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

And this is the output just before the fourth optimizer_state['optimizer'].step() (which we get the error message above):

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  10343 MiB |  14441 MiB |  58834 MiB |  48490 MiB |
|       from large pool |  10337 MiB |  14435 MiB |  58778 MiB |  48440 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     50 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  10343 MiB |  14441 MiB |  58834 MiB |  48490 MiB |
|       from large pool |  10337 MiB |  14435 MiB |  58778 MiB |  48440 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     50 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  10342 MiB |  14438 MiB |  58816 MiB |  48474 MiB |
|       from large pool |  10336 MiB |  14432 MiB |  58761 MiB |  48424 MiB |
|       from small pool |      5 MiB |      6 MiB |     55 MiB |     49 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  14526 MiB |  14526 MiB |  18218 MiB |   3692 MiB |
|       from large pool |  14518 MiB |  14518 MiB |  18210 MiB |   3692 MiB |
|       from small pool |      8 MiB |      8 MiB |      8 MiB |      0 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   2134 MiB |   3623 MiB |  33192 MiB |  31057 MiB |
|       from large pool |   2132 MiB |   3620 MiB |  33129 MiB |  30996 MiB |
|       from small pool |      2 MiB |      3 MiB |     63 MiB |     60 MiB |
|---------------------------------------------------------------------------|
| Allocations           |      80    |      98    |     806    |     726    |
|       from large pool |      22    |      41    |     347    |     325    |
|       from small pool |      58    |      76    |     459    |     401    |
|---------------------------------------------------------------------------|
| Active allocs         |      80    |      98    |     806    |     726    |
|       from large pool |      22    |      41    |     347    |     325    |
|       from small pool |      58    |      76    |     459    |     401    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      15    |      28    |      31    |      16    |
|       from large pool |      11    |      25    |      27    |      16    |
|       from small pool |       4    |       4    |       4    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      13    |      20    |     282    |     269    |
|       from large pool |       8    |      13    |     128    |     120    |
|       from small pool |       5    |       8    |     154    |     149    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

One thing I am confused about is that Allocated memory is smaller (although Non-releasable allocs increased).

The issue persists with and without torch.compile (note that we are using backend=aot_eager). The difference is that when we do torch.compile, we get OOM after the third update (only on RANK=0) and when we don't do torch.compile, we get OOM after the first update (on several RANKs). The line that causes the OOM is the same.
When I was testing yesterday, I did not get OOM for the first ten update steps (so we might be really at the boundary).

I will continue looking at this, but if you have any recommendations, it would be helpful! Thank you again.

priyakasimbeg commented 1 year ago

@janeyx99 I think our criteo data download and set up fixes are still in progress. In the meantime I can add you to our external GCP project and set you up with a VM to help debug this OOM. Let me know if that sounds like a good idea to you

janeyx99 commented 1 year ago

@priyakasimbeg That sounds like a good plan! One disclaimer that might take a little long is that I would need to get a GPU machine to be able to run VMs/docker set up on my side (current dev env is internal and not able to run containers).

Also, if your VM is backed into an existing machine and I wouldn't have to find a local GPU, that'd be even better!

janeyx99 commented 1 year ago

@pomonam The fact that turning fused=False didn't fix it for this workload reminds me of an issue I ran into while writing benchmarks: https://github.com/pytorch/pytorch/issues/100264. This issue is due to dynamo holding onto memory accidentally and should have been fixed already (but not yet in any release).

I could also try once I have a machine set up but would you be able to test on a recent nightly torch build if you aren't already? One can install CUDA 11.8 version (there's also a 12.1 version) with command pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

pomonam commented 1 year ago

@janeyx99 Thank you for looking into this, and sorry for not updating here. I indeed noticed that running on nightly fixes the OOM issue (#484). One strange thing is I still sometimes hit into OOM on the nightly version (in the middle or at the end of training). But it runs without OOM in most cases. So, I think it was indeed related to that issue.

janeyx99 commented 1 year ago

@pomonam Ah, were those OOMs also on the optimizer step? It is possible that there are other memory issues not related to optimizers and it would be good to get those fixed too. If we could narrow down the surface, I could try getting the relevant people looking into it as I'm not a dynamo/compiler pro haha.

Also, just FYI, I don't believe there are code changes for your model to avoid OOMs on the fourth step on PT 2.0.1 since the issue was with torch.compile internals and was only fixed after that release.

priyakasimbeg commented 1 year ago

@janeyx99 what is a good email-address to use to add you to our GCP project? Also lmk if there are any dynamo/compiler people that I can set up a VM for.

janeyx99 commented 1 year ago

@janeyx99 what is a good email-address to use to add you to our GCP project?

janeyx@meta.com would be good

Also lmk if there are any dynamo/compiler people that I can set up a VM for.

The first that comes to mind is @mlazos who's been working on compiling optimizers. It is not yet certain that this bug is optimizer compilation related though.

priyakasimbeg commented 1 year ago

@janeyx99 I added you to our GCP project. Did you get a chance to test your access yet? I sent an email w GCP instructions to janeyx@meta.com but not sure if you received it.

priyakasimbeg commented 1 year ago

Just updating here that Jane has access to our VMs now.

To quickly reproduce the Criteo bug I recommend using one of our pre-built docker images:

pull the docker container w torch nightly (torch.dev08202023): docker pull [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing)
Run the container in the background docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host [us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing](http://us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_pytorch_diagnosing) -a true This will print out a container ID.
Bash into the container docker exec -it <container_id> /bin/bash
Run the submission runner on criteo torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 submission_runner.py --framework=pytorch --workload=criteo1tb --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/criteo1tb --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=criteo_pytorch_oom_debugging --overwrite=True --save_checkpoints=False --max_global_steps=10 --torch_compile=true

janeyx99 commented 1 year ago

To confirm the current status: does the OOM only now occur nondeterministically with --torch_compile=true? Or does it OOM in eager as well?

pomonam commented 1 year ago

It used to fail in both cases (so I don't think this is a compile issue). My PR https://github.com/mlcommons/algorithmic-efficiency/pull/502 seems to fix the issue (although we have to use a much smaller batch size during evaluation).

pomonam commented 1 year ago

Based on the conversation, I will close this issue as the error has been solved with the recent fix (PR above), and this is marked as a hard blocker. However, I created a new issue to debug the OOM issue when using the full embedding matrix: #505. (Please feel free to reopen it if we hit the issue again.)

mlcommons / algorithmic-efficiency

Pytorch Criteo CUDA error #425

Description

Steps to Reproduce