huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.56k stars 911 forks source link

Issues with saving model/optimizer and loading them back #285

Closed ashutoshsaboo closed 2 years ago

ashutoshsaboo commented 2 years ago

Hello @sgugger ,

Came across multiple related issues regarding this - https://github.com/huggingface/accelerate/issues/242, https://github.com/huggingface/accelerate/issues/154 . They were all closed with this PR - https://github.com/huggingface/accelerate/pull/255, but unfortunately the PR doesn't seem to have much documentation.

I was looking for specifically: saving a model, it's optimizer state, LR scheduler state, it's random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

I know there's this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it's usage. a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)? b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I'm using HF's BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there's any simple toy example that already uses these new checkpointing methods, and if you can help share that'd be pretty helpful!

The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

Request: If some more detailed examples can be added to the docs that'd be really awesome and help clarify about some of these specifics to users more easily!

Thanks so much in advance! :)

sgugger commented 2 years ago

Hi there, thanks for reaching out! Saving should only be done in the main process (no need to save several times!) but the loading should be done in all processes so each process has the right state. We'll work on adding more documentation on #255 next week (Zach is on vacation right now :-) )

The save_state method won't call save_pretrained, it just saves the state dictionary of the model, so you will need to manually call it with your checkpoint saves if you want to reload your checkpoints outside of your script. However if you only use them in the script, using load_state will work perfectly fine, it's only for the final model save that you will need to call save_pretrained.

As for a release, it's coming today actually!

ashutoshsaboo commented 2 years ago

Thanks for the response! What do you mean by "want to reload your checkpoints outside of your script"? If my training abruptly stops in between, and I need to resume it from the last saved checkpoint - would that be possible with load_state (and not calling save_pretrained between checkpoints)? @sgugger

If my use-case doesn't involve pushing to hub (which save_pretrained does under the hood), then could you elaborate specific differences between save_pretrained and save_state (with regards to model saving, since save_state might be saving other entities too)?

The docs for save_pretrained are abstract as they just mention "saves a model", so I don't know exactly what save_state is trying to save for the model vs save_pretrained and if there are any differences (which looks like to me there'd be some - since you also recommend towards the end to use save_pretrained for the final model and not save_state).

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()]

sgugger commented 2 years ago

The difference between save_pretrained and save_state wrt the model is that save_state only saves the model weights, whereas save_pretrained saves the model config as well.

But your model is already instantiated in your script so you can reload the weights inside (with load_state), save_pretrained is not necessary for that. However if you want to use your model outside of your training script, especially with the from_pretrained method, you will need the model config, hence my suggestion to call save_pretrained at the end, to save the final model.

ashutoshsaboo commented 2 years ago

Thanks that makes sense to me now!

For "loading should be done in all processes so each process has the right state" -> I was curious conceptually, shouldn't load_state() on all processes vs load_state() on main process only + calling accelerator.prepare() on the loaded model have the same result ideally? If not, why? If yes, does the former have any explicit advantages (other than the naive part of writing reduced boilerplate code)? @sgugger

Small side question: So load_state will load the model across all processes and also allocate/move model to each GPU (in a multi gpu case) automatically without explicitly calling accelerator.prepare(), is that the correct understanding?

sgugger commented 2 years ago

load_state should be called after accelerator.prepare to ensure reproducilibity. It shouldn't change anything if you call it before/after for the model or optimizers, but since accelerator.prepare initializes the generators of the dataloaders, you will have a slight difference there. It saves the RNGs on each process, so needs to be called on each process to be able to restore that.

Note that you can safely call save_state on each process too, the fact it will only save on the main process is done by Accelerate behind the scenes.

ashutoshsaboo commented 2 years ago

That makes sense. Let me clarify a bit, about the side question above. So my use-case is something like this where I do some cross validation using train-val sets, do save_state at end of each epoch and get the best performing model by doing load_state right towards the end. Now this best performing model is eventually passed to the test set in the same script , for getting final eval metrics. Pretty standard usecase basically.

The place where I was coming from in the side question above, is that when I get the best performing model using load_state, all my data loaders (including the test set one), existing model (note that: which might not be same as best model, since best model could be at an earlier iteration), optimizers etc are already prepared using the accelerator at the top of the script.

So when I actually do load_state for getting the best model, would I have to do accelerator.prepare specifically on this best_model again (which != existing model), or would load_state do the heavy lifting internally of transferring these models to all the processes and also move them to the gpus (if my understanding is right, former would happen most likely because load_state is being called on each process rather than not only on main_process as you suggested, but I wasn't sure if they'd be moved to the relevant gpus too without calling accelerator.prepare?) ? Apologies I didn't clarify this entirely earlier. But this was the intent behind the side question in my previous message. If you could help answer? @sgugger

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

cccntu commented 2 years ago

Hi @sgugger , I think there is a bug. I can try to make a reproducible example if you think that's necessary. But I want to write it down first.

load_state should be called after accelerator.prepare to ensure reproducilibity.

But when load_state saves model, it calls get_state_dict https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/accelerator.py#L925 which unwraps model first, before returning the state_dict https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/accelerator.py#L1020-L1021

So when I call accelerator.prepare, then load_state, I get an error:

  File "python3.8/site-packages/accelerate/accelerator.py", line 940, in load_state
    load_accelerator_state(
  File "python3.8/site-packages/accelerate/checkpointing.py", line 136, in load_accelerator_state
    models[i].load_state_dict(torch.load(input_model_file, map_location="cpu"))
  File "python3.8/site-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:
        Missing key(s) in state_dict: "module.transformer.wte.weight", "module.transformer.wpe.weight", "module.transformer.h.0.ln_1.weight", "module.transformer.h.0.ln_1.bi

I added an line to print out the loaded dict, it looks like this

['transformer.wte.weight', 'transformer.wpe.weight', 'transformer.h.0.ln_1.weight', 
sgugger commented 2 years ago

cc @muellerzr

muellerzr commented 2 years ago

@cccntu are we saving the state before calling it in accelerator.prepare and then trying to load it in? A reproducer is needed, yes

cccntu commented 2 years ago

@muellerzr I followed the suggestion above to save after prepare

Here is my example:

name this file 1.py and run the script below twice

import os

from accelerate import Accelerator
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, get_scheduler, set_seed

def main():
    dataset = [1, 2, 3]
    dataloader = DataLoader(dataset, batch_size=1)
    model = AutoModelForCausalLM.from_pretrained("gpt2")
    optimizer = AdamW(model.parameters(), lr=0.001)
    accelerator = Accelerator()
    model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
    # if path exist , load model
    path = "checkpoint"
    if os.path.exists(path):
        print("loading")
        accelerator.load_state(path)
    else:
        print("did not load")
    accelerator.save_state(path)

if __name__ == "__main__":
    main()
echo "compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  zero_stage: 1
distributed_type: DEEPSPEED
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 1
mixed_precision: bf16
" > accelerate_config.yaml

accelerate launch --config_file accelerate_config.yaml 1.py
pip freeze | grep -e accelerate -e deepspeed
accelerate==0.10.0
deepspeed==0.6.5
muellerzr commented 2 years ago

@cccntu I saw no issue with this when I saved and loaded from a 2gpu state or save and loaded from a 1gpu state. Are you modifying the accelerate config in between runs?

Or did you save the state in a particular distributed mode and then tried loading it in w/o distributed?

My steps:

accelerate config
--- Script answers below ---
(0, 2, 1, no, no, 2, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully
# delete our checkpoint originally made
rm -r checkpoint
accelerate config
--- Script answers below ---
(0, no, no, no)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and ran successfully
cccntu commented 2 years ago

@muellerzr Could you try it again, but with deepspeed.

accelerate config
--- Script answers below ---
(0, 0, no, yes, *defaults)
accelerate launch test_script.py
# printed "did not load"
accelerate launch test_script.py
# printed "loading" and get errors

Or use my bash script in my last reply.

I just noticed the error I got has this:

RuntimeError: Error(s) in loading state_dict for DeepSpeedEngine:

and the code that unwrap_model calls has this:

https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/utils/other.py#L35-L51

The error only happens in deepspeed because normally, the model is appended to self._models, before wrapping it in another class.

https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/accelerator.py#L381-L383

But with deepspeed, it calls prepare_deepspeed instead https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/accelerator.py#L484-L488 And the model is appended to self._models after the wrapper class. https://github.com/huggingface/accelerate/blob/6ebddcd5e0cb29d1453657896f2e3b1cbf9952d4/src/accelerate/accelerator.py#L691-L692

pacman100 commented 2 years ago

Hello @cccntu ,

A few quick queries:

  1. Are you trying to reload the checkpoint for further training/resuming training or inference? If you want to run inference, then only ZeRO stage 3 is applicable because there are no optimizer states and gradients involved during inference. If you want to resume training, please follow the answer to this issue #418

  2. Does the following DeepSpeed Integration documentation section on Saving and loading model answer enable you to achieve your end goal: deepspeed ?

cccntu commented 2 years ago

Hi @pacman100 ,

  1. I am trying to resume training, that's why I use save_state to save also the optimizer, instead of unwrap_model + save. The answer you linked uses deepspeed load_checkpoint. I think that might work, I need to try that. Thanks for the pointer. I am not sure if it loads optimizer and scheduler from the usage, see https://github.com/microsoft/DeepSpeed/issues/647

  2. The doc says

    Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.

Does that mean save_state and load_state should work? I am not using stage-3.

cccntu commented 2 years ago

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

cyk1337 commented 1 year ago

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

I met the same problem. With deepspeed, both model.save_checkpoint and accelerator.save_state will hang. How to save the optimizer / lr_scheduler states for resuming training (w/ deepspeed)?

pacman100 commented 1 year ago

@cyk1337, minimal reproducible example? DeepSpeed tests daily run on GPU hosted runner wherein saving and loading checkpoint functionality is working fine.

https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L699

amarazad commented 11 months ago

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....
jiaxilv commented 5 months ago

I have the same problem as @cyk1337: With Deepspeed stage2, model.save_checkpoint and accelerator.save_state block the program from continuing. I am now saving my unet with the following code:

    unwrapped_model = accelerator.unwrap_model(unet)
    accelerator.save(unwrapped_model.state_dict(), os.path.join(save_path, "model.pth"))

But I found that when I use the following code to save the optimizer parameters, it saves only a certain part of the optimizer parameters, not the complete optimizer parameters:

    unwrapped_optimizer = accelerator.unwrap_model(optimizer)
    accelerator.save(unwrapped_optimizer.state_dict(), os.path.join(save_path, "optimizer.pth"))

Is there anything that can be done to fix this?

jhliu17 commented 3 months ago

Hi @pacman100 , I am using FSDP with full sharding. I use the following to save sate so that I can resume with the last state:

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    if config["SAVE_STATE"] :
            accelerator.save_state(save_state_dir)

And to resume:

model = accelerator.prepare(model)
optimizer =  accelerator.prepare(optimizer) 
...
if config["RESUME_STATE"]:
              accelerator.wait_for_everyone()
              accelerator.load_state(save_state_dir)

However, while saving, it hangs at accelerator.save_state(save_state_dir) and after a long time throws the following error:

INFO:accelerate.accelerator:Saving FSDP model
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804648 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804768 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18364, OpType=ALLGATHER, Timeout(ms)=4800000) ran for 4804777 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18338, OpType=_ALLGATHER_BASE, Timeout(ms)=4800000) ran for 4807653 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2820780 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 2820776) of binary: /home/apa/anaconda3/envs/py39_llm/bin/python
Traceback (most recent call last): ..
....

I met the same problem. Do you have any solutions to it? Thank you~