Closed sam-hieken closed 1 year ago
cc @pacman100
Hello @sam-hieken, please refer to this: https://github.com/huggingface/accelerate/issues/1358#issuecomment-1523784839
As stated in that comment, the checkpointing tests are working. This is probably an issue with the ProjectConfiguration
.
@muellerzr, could you look into this?
Also @sam-hieken, before saving state make sure to add distributed barrier so that all processes reach that point before saving. Adding accelerator.wait_for_everyone()
before accelerator.save()
resolves the issue with FSDP. What is happening is that in normal Multi-GPU setup even without wait_for_everyone()
, both processes execute savve_state
at the same time, whereas in FSDP, one of the processes is reaching that point late (hence the need for wait_for_everyone
). However, note that this has nothing to do with FSDP integration. Hope this helps
Thanks so much @pacman100, adding wait_for_everyone()
before save_state()
fixed my problem! I take it I should leave this open since there may be an issue with ProjectConfiguration
?
Hello @pacman100, is there any update on this?
It seems that save_state()
is called multiple times also when running with distributed_type: MULTI_GPU
. When automatic_checkpoint_naming=True
, I got the following same error.
ValueError: Checkpoint directory . /output/checkpoints/checkpoint_0 (0) already exists. Please manually override `self.save_iteration` with what iteration to start with.
Hello @khokao, does the above comment suggestion not resolve it https://github.com/huggingface/accelerate/issues/1374#issuecomment-1530917319?
As stated in that comment, the checkpointing tests are working. This is probably an issue with the ProjectConfiguration. @muellerzr, could you look into this?
As I said, this isn't an issue with FSDP integration
@pacman100
I added wait_for_everyone()
before save_state()
, but it still raises the same error.
I've tried the experiment several times, but errors rarely occur, and in most cases, there seem to be no errors. It might be an issue with my server, so you don't need to address it!
Thank you very much for the swift reply! @pacman100
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@pacman100 @muellerzr
I'm having this issue as well, even after trying the wait_for_everyone()
fix. As mentioned before, this likely has nothing to do with the distributed_type
and instead seems to be an issue with ProjectConfiguration
. Below is my Accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 16
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
My error is the following:
ValueError: Checkpoint directory . /output/checkpoints/checkpoint_0 (0) already exists. Please manually override `self.save_iteration` with what iteration to start with.
Hello, a minimal reproducer for this would help @muellerzr deep dive into this wrt Project Configuration
Thanks for the reply @pacman100 . Here is a script for @muellerzr that reproduces the error with my above Accelerate config:
import os
import random
import numpy as np
import torch
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import SequentialLR, LinearLR, CosineAnnealingLR
from accelerate import Accelerator
from accelerate.utils import ProjectConfiguration
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2Model
def seed_everything(seed):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
project_config = ProjectConfiguration(
project_dir='./output',
automatic_checkpoint_naming=True,
total_limit=None,
)
accelerator = Accelerator(
mixed_precision='bf16',
gradient_accumulation_steps=16,
log_with=None,
project_config=project_config,
)
device = accelerator.device
seed_everything(1337)
dataset = load_dataset('wikitext', 'wikitext-103-v1', num_proc=12)
accelerator.print(dataset)
train_loader = DataLoader(
dataset=dataset['train'],
batch_size=16,
shuffle=True,
num_workers=8,
pin_memory=True,
drop_last=False,
)
val_loader = DataLoader(
dataset=dataset['validation'],
batch_size=16,
shuffle=False,
num_workers=8,
pin_memory=True,
drop_last=False,
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.1, amsgrad=False)
warmup_iters = 100
max_iters = 1000
scheduler = SequentialLR(
optimizer=optimizer,
schedulers=[
LinearLR(optimizer, start_factor=1e-8, end_factor=1.0, total_iters=warmup_iters),
CosineAnnealingLR(optimizer, T_max=max_iters - warmup_iters),
],
milestones=[warmup_iters],
)
model, optimizer, train_loader, val_loader, scheduler = accelerator.prepare(model, optimizer, train_loader, val_loader, scheduler)
accelerator.save_state() # ValueError: Checkpoint directory ./output/checkpoints/checkpoint_0 (0) already exists. Please manually override `self.save_iteration` with what iteration to start with
I'm also having this issue, and it's not resolved by calling wait_for_everyone()
before save_state()
.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hello,
save_state()
appears to be trying to save the same checkpoint twice when using FSDP, which leads to an error (I did check to ensure no checkpoints existed before running the script). The following script should fully reproduce the issue:Where
test.txt
is just any text file.The error is as follows:
Running the script with the same configuration above but without FSDP appeared successful.
Like I said above, I believe it's trying to save the same checkpoint twice, based on my error message, and the fact that the checkpoint is created after the script fails.
Expected behavior
Working
save_state()
function.On a somewhat related note, I did try using
instead of
But as stated in #1171 this shouldn't be necessary, and just lead to the program blocking indefinitely at
save_state()
Thank you.