huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.69k stars 934 forks source link

conflict in deepsepped frame and accelerate.save_mdoel #2985

Open yangtian6781 opened 1 month ago

yangtian6781 commented 1 month ago

System Info

- `Accelerate` version: 0.33.0
- Platform: Linux-5.4.0-182-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/tianyang/miniconda3/envs/bunny/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 503.79 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

import torch
from accelerate import Accelerator
from accelerate.utils import ProjectConfiguration
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch.optim import AdamW

prefix_date = 'stage3_bf16'
projectconfig = ProjectConfiguration(project_dir=f'/home/tianyang/transformers-code/work_dir/{prefix_date}',
                                     logging_dir=f'/home/tianyang/transformers-code/work_dir/{prefix_date}')
class Simple(Dataset):
    def __init__(self, root='./data.bin') -> None:
        super().__init__()
        self.data = torch.load(root, weights_only=True)

    def __len__(self):
        return 150

    def __getitem__(self, index):
        data = torch.tensor(self.data[index], dtype=torch.bfloat16)
        return data

class Linear_model(torch.nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.linear1 = nn.Linear(5, 10)
        self.linear2 = nn.Linear(10, 5)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x

model = Linear_model()
optimizer = AdamW(model.parameters())

my_dataset = Simple()

my_dataloader = DataLoader(my_dataset, batch_size=5, shuffle=False, num_workers=2, drop_last=False)
accelerator = Accelerator(project_config=projectconfig)

my_dataloader, model, optimizer = accelerator.prepare(my_dataloader, model, optimizer)

accelerator.save_model(model=model, save_directory=f'{accelerator.project_dir}/save_model', safe_serialization=False)

Expected behavior

in my code, './data.bin' file contains simple number from 1 to 150, i set deepspeed stage 3 and zero3_save_16bit_model=True,i only want to save model's state_dict. an error occurs although this code successfully save model's state_dict into pytorch_model.bin file:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/tianyang/transformers-code/try_len_loader.py", line 48, in <module>
[rank1]:     accelerator.save_model(model=model, save_directory=f'{accelerator.project_dir}/save_model', safe_serialization=False)
[rank1]:   File "/home/tianyang/miniconda3/envs/bunny/lib/python3.10/site-packages/accelerate/accelerator.py", line 2790, in save_model
[rank1]:     state_dict_split = split_torch_state_dict_into_shards(
[rank1]:   File "/home/tianyang/miniconda3/envs/bunny/lib/python3.10/site-packages/huggingface_hub/serialization/_torch.py", line 330, in split_torch_state_dict_into_shards
[rank1]:     return split_state_dict_into_shards_factory(
[rank1]:   File "/home/tianyang/miniconda3/envs/bunny/lib/python3.10/site-packages/huggingface_hub/serialization/_base.py", line 100, in split_state_dict_into_shards_factory
[rank1]:     for key, tensor in state_dict.items():
[rank1]: AttributeError: 'NoneType' object has no attribute 'items'

it may means accelerate is not completely compatible with deepspeed zero3?

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.