huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.85k stars 953 forks source link

Missing 'fork_launched' attribute in version 0.27.0 and 0.28.0 #2635

Closed Dadja111 closed 5 months ago

Dadja111 commented 6 months ago

Recently I come across accelerate library to manage training torch model on kaggle TPU. However, when I use the prepare method of accelerate, i get the following error: 'AcceleratorState' object has no attribute 'fork_launched'

The complete trace is show below:

AttributeError Traceback (most recent call last) Cell In[21], line 11 8 lr = 1e-3 9 optimizer = optim.Adam(model_b.parameters(), lr=lr) ---> 11 model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader) 12 if WANDB: 13 run = wandb.init(project="Brain", job_type="Transformer", config=args.dict)

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1263, in Accelerator.prepare(self, device_placement, *args) 1261 # MS-AMP will handle the device placement 1262 deviceplacement = [False for in args] -> 1263 result = tuple( 1264 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 1265 ) 1266 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement)) 1268 if tpu_should_fix_optimizer or (self.mixed_precision == "fp8" and self.fp8_recipe_handler.backend == "TE"): 1269 # 2. grabbing new model parameters

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1264, in (.0) 1261 # MS-AMP will handle the device placement 1262 deviceplacement = [False for in args] 1263 result = tuple( -> 1264 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 1265 ) 1266 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement)) 1268 if tpu_should_fix_optimizer or (self.mixed_precision == "fp8" and self.fp8_recipe_handler.backend == "TE"): 1269 # 2. grabbing new model parameters

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1140, in Accelerator._prepare_one(self, obj, first_pass, device_placement) 1138 return self.prepare_data_loader(obj, device_placement=device_placement) 1139 elif isinstance(obj, torch.nn.Module): -> 1140 return self.prepare_model(obj, device_placement=device_placement) 1141 elif isinstance(obj, torch.optim.Optimizer): 1142 optimizer = self.prepare_optimizer(obj, device_placement=device_placement)

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1445, in Accelerator.prepare_model(self, model, device_placement, evaluation_mode) 1443 kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {} 1444 model = torch.nn.parallel.DistributedDataParallel(model, **kwargs) -> 1445 elif self.distributed_type == DistributedType.XLA and self.state.fork_launched: 1446 model = xmp.MpModelWrapper(model).to(self.device) 1447 # torch.compile should be called last and only if the model isn't already compiled.

AttributeError: 'AcceleratorState' object has no attribute 'fork_launched

muellerzr commented 6 months ago

Can you try using pip install git+https://github.com/huggingface/accelerate@patchfixes? Should fix a few issues on XLA :) (Let us know if it's working for you please!)

Also would love to know what accelerator.state says after

Dadja111 commented 6 months ago

Thank a lot @muellerzr ! I check it out and let you how the things are going.

Dadja111 commented 6 months ago

I am still facing the same issue with more details on the causes: AttributeError: AcceleratorState object has no attribute fork_launched. This happens if AcceleratorState._reset_state() was called and an Accelerator or PartialState was not reinitialized.

Indeed instantiation of the accelerate throw error with respect to partialState at the very first attempt. However, the subsequent does not throw an error like if everything is fine and the prepare method. It is worth mentioning that the library run on cpu without issue.

muellerzr commented 6 months ago

Can you share your full code?

muellerzr commented 6 months ago

Oh wait you’re in Kaggle/on TPU. That’s totally expected. Please go through the notebook launcher tutorial and use it: https://huggingface.co/docs/accelerate/basic_tutorials/notebook

Dadja111 commented 6 months ago

Ok I will check it.

Dadja111 commented 6 months ago
import os
import gc
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
from torch import nn
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from scipy.signal import hilbert
from torch.utils.data import DataLoader

import torch.optim as optim
import complexPyTorch.complexLayers as cplxlayer
import matplotlib.pyplot as plt
from multiprocessing import Pool

from accelerate import Accelerator
from accelerate.utils import set_seed

#above for the package importation and below for the training

set_seed(SEED)
accelerator = Accelerator(mixed_precision="fp16")
model_b = ModelCloneSparse()
epoch_number = 100
lr = 1e-3
optimizer = optim.Adam(model_b.parameters(), lr=lr)

model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader)

train_loss = []
val_loss = []
for epoch in range(epoch_number):
    train_batch_loss = []
    train_total = len(train_loader)
    s = time.perf_counter()
    model_b.train()
    for i, batch in enumerate(train_loader):
        X, y = batch
        #X = X.to(DEVICE, dtype=torch.float)
        #y = y.to(DEVICE, dtype=torch.float)
        y_pred = model_b(X)
        loss = kl_loss(y, y_pred)
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        train_batch_loss.append(accelerator.gather(loss))
        print('\rEpoch {} Batch {}/{}\tTrain loss: {:.2f}\t'.format(epoch, i, train_total,
                            torch.stack(train_batch_loss).mean()), end="")
    train_batch_loss[-1] *= (X.shape[0]/train_batch)
    train_loss.append(float(torch.stack(train_batch_loss).mean()))
    #break
    val_batch_loss = []
    val_total = len(val_loader)
    model_b.eval()
    with torch.no_grad():
        for i, batch in enumerate(val_loader):
            X, y = batch
            #X = X.to(DEVICE, dtype=torch.float)
            #y = y.to(DEVICE)
            y_pred = model_b(X)
            loss = kl_loss(y, y_pred)
            val_batch_loss.append(accelerator.gather(loss))
        val_batch_loss[-1] *= (X.shape[0]/val_batch)
        val_loss.append(float(torch.stack(val_batch_loss).mean()))
    e = time.perf_counter()

    print('\rEpoch {} Batch {}/{}\tTrain loss: {:.2f}\tVal loss: {:.2f} in {:.2f} seconds'.format(epoch, train_total, 
                                    train_total, train_loss[-1], val_loss[-1], e-s))

Error now on the call Accelerate():

AttributeError                            Traceback (most recent call last)
Cell In[16], line 2
      1 set_seed(SEED)
----> 2 accelerator = Accelerator(mixed_precision="fp16")
      3 model_b = ModelCloneSparse()
      4 epoch_number = 100

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:449, in Accelerator.__init__(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, dispatch_batches, even_batches, use_seedable_sampler, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend)
    445 self.scaler = None
    446 self.native_amp = False
    447 if (
    448     self.state.mixed_precision == "fp16"
--> 449     and self.device.type != "cpu"
    450     and self.distributed_type not in (DistributedType.DEEPSPEED, DistributedType.MEGATRON_LM)
    451 ):
    452     self.native_amp = True
    453     if self.device.type not in ("xpu", "cuda", "mps", "npu", "xla", "mlu") or is_torch_xla_available(
    454         check_is_tpu=True
    455     ):

AttributeError: 'NoneType' object has no attribute 'type'

Before for import I use:

from accelerate import Accelerate
accelerator = Accelerator() 

and for training
model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader)

with the following at the update step:

accelerator.backward(loss) instead of loss.backward()
Dadja111 commented 6 months ago

I run the following before import: import os from accelerate.utils import write_basic_config

write_basic_config() # Write a config file os._exit(00) # Restart the notebook

Dadja111 commented 6 months ago

Thank a lot for your help. i was not finally able to run accelerate on TPU because my connection to TPU cores on kaggle failed. It is just an internal error. I will continue with standard GPU.

Thank for your time and your kindness. Wish you a nice week.

muellerzr commented 6 months ago

@Dadja111 I do again believe that this may be stemming from not using the notebook_launcher. What does your full notebook look like? Anything device and torch related need to be declared in your training function

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.