pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.49k stars 135 forks source link

A bug related to Torch version #439

Closed zyushun closed 4 weeks ago

zyushun commented 1 month ago

Hi, I am using torch 2.5.0.dev20240617+cu121 and I have the following error unsolved.

[rank0]:[rank0]:   File "...../torchtitan/parallelisms/parallelize_llama.py", line 50, in checkpoint_wrapper
[rank0]:[rank0]:     from torch.utils.checkpoint import (
[rank0]:[rank0]: ImportError: cannot import name 'CheckpointPolicy' from 'torch.utils.checkpoint' 

Any suggestion?

awgu commented 1 month ago

Sorry for the inconvenience :/

It looks like your version is not new enough. A selective activation checkpointing policy was made public 2 weeks ago (https://github.com/pytorch/pytorch/commit/1877b7896c237567285804ecc138bc86180a7ced), and that led to the creation of CheckpointPolicy, which torchtitan has migrated to use.

allela-roy commented 1 month ago

@awgu , I am also encountering this issue with the latest nightly build of torch. I also tried with the previous nightly build (torch==2.4.0.dev20240612) and still encountering the issue. cc @lessw2020

awgu commented 1 month ago

That is unexpected since I see that CheckpointPolicy is included in the torch/utils/checkpoint.py's __all__ in both the latest July 5 nightly release and July 4 nightly release: https://github.com/pytorch/pytorch/blob/419de136208cf578f2d1202c3e192e1541945d99/torch/utils/checkpoint.py#L32 https://github.com/pytorch/pytorch/blob/a532737602e884d983cbaa3f12fdede56afc5131/torch/utils/checkpoint.py#L32

I do not have the setup to repro this right now, so I will have to get back to you. cc: @tianyu-l it would be good to see why our CI did not catch this breakage (since right now as of July 5, 4-GPU CI is passing).

In the mean time, can you work around by not using selective op activation checkpointing (since that import is only run if using selective op AC)? https://github.com/pytorch/torchtitan/blob/b0ed7f075921357b01e28fddc6d90a2cc410bab3/train_configs/llama3_8b.toml#L52-L54 E.g., you can use mode = 'full' or mode = 'none' or mode = 'selective' with selective_ac_option = 2.

zyushun commented 1 month ago

Thanks for the swift response @awgu ! I am using mode = 'none' temporarily for now and it works.

allela-roy commented 1 month ago

Thanks @awgu .The only issue is that without selective op checkpointing, we quickly run out of memory as you can see below, this is on a p4de instance (8 x A100 nodes with 80GB each). This is for Llama3 8b which previously ran fine with selective checkpointing( memory consumption was at ~70%.

024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - WARNING - 1 CUDA memory allocation retries.
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,694  mfu: 31.45%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,694  mfu: 31.45%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,694  mfu: 31.45%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,693  mfu: 31.43%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,694  mfu: 31.45%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.56GiB(97.99%)  wps: 1,694  mfu: 31.45%
2024-07-06 15:02:38,485 - root - INFO - step:  1  loss: 12.2514  memory: 77.46GiB(97.86%)  wps: 1,694  mfu: 31.45%

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 4 has a total capacity of 79.15 GiB of which 3.10 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 73.72 GiB is allocated by PyTorch, and 706.19 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

awgu commented 1 month ago

@allela-roy The alternative to unblock you is to use mode = 'full' for now. We need to investigate whether torchtitan is out of date with the nightly or if the nightly is out of date with torchtitan.

tianyu-l commented 1 month ago

@awgu @zyushun @allela-roy The new SAC API in torch went through some merge-revert-merge during the week around 06/13 -- 06/17. For details check https://github.com/pytorch/pytorch/pull/125795

So would you please try again on the latest pytorch nightly? It should resolve the import issues.

awgu commented 4 weeks ago

closing for now since this is a known issue of requiring a newer nightly