Closed Dadja111 closed 5 months ago
Can you try using pip install git+https://github.com/huggingface/accelerate@patchfixes
? Should fix a few issues on XLA :) (Let us know if it's working for you please!)
Also would love to know what accelerator.state
says after
Thank a lot @muellerzr ! I check it out and let you how the things are going.
I am still facing the same issue with more details on the causes:
AttributeError: AcceleratorState
object has no attribute fork_launched
. This happens if AcceleratorState._reset_state()
was called and an Accelerator
or PartialState
was not reinitialized.
Indeed instantiation of the accelerate throw error with respect to partialState at the very first attempt. However, the subsequent does not throw an error like if everything is fine and the prepare method. It is worth mentioning that the library run on cpu without issue.
Can you share your full code?
Oh wait you’re in Kaggle/on TPU. That’s totally expected. Please go through the notebook launcher tutorial and use it: https://huggingface.co/docs/accelerate/basic_tutorials/notebook
Ok I will check it.
import os
import gc
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
from torch import nn
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from scipy.signal import hilbert
from torch.utils.data import DataLoader
import torch.optim as optim
import complexPyTorch.complexLayers as cplxlayer
import matplotlib.pyplot as plt
from multiprocessing import Pool
from accelerate import Accelerator
from accelerate.utils import set_seed
#above for the package importation and below for the training
set_seed(SEED)
accelerator = Accelerator(mixed_precision="fp16")
model_b = ModelCloneSparse()
epoch_number = 100
lr = 1e-3
optimizer = optim.Adam(model_b.parameters(), lr=lr)
model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader)
train_loss = []
val_loss = []
for epoch in range(epoch_number):
train_batch_loss = []
train_total = len(train_loader)
s = time.perf_counter()
model_b.train()
for i, batch in enumerate(train_loader):
X, y = batch
#X = X.to(DEVICE, dtype=torch.float)
#y = y.to(DEVICE, dtype=torch.float)
y_pred = model_b(X)
loss = kl_loss(y, y_pred)
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
train_batch_loss.append(accelerator.gather(loss))
print('\rEpoch {} Batch {}/{}\tTrain loss: {:.2f}\t'.format(epoch, i, train_total,
torch.stack(train_batch_loss).mean()), end="")
train_batch_loss[-1] *= (X.shape[0]/train_batch)
train_loss.append(float(torch.stack(train_batch_loss).mean()))
#break
val_batch_loss = []
val_total = len(val_loader)
model_b.eval()
with torch.no_grad():
for i, batch in enumerate(val_loader):
X, y = batch
#X = X.to(DEVICE, dtype=torch.float)
#y = y.to(DEVICE)
y_pred = model_b(X)
loss = kl_loss(y, y_pred)
val_batch_loss.append(accelerator.gather(loss))
val_batch_loss[-1] *= (X.shape[0]/val_batch)
val_loss.append(float(torch.stack(val_batch_loss).mean()))
e = time.perf_counter()
print('\rEpoch {} Batch {}/{}\tTrain loss: {:.2f}\tVal loss: {:.2f} in {:.2f} seconds'.format(epoch, train_total,
train_total, train_loss[-1], val_loss[-1], e-s))
Error now on the call Accelerate():
AttributeError Traceback (most recent call last)
Cell In[16], line 2
1 set_seed(SEED)
----> 2 accelerator = Accelerator(mixed_precision="fp16")
3 model_b = ModelCloneSparse()
4 epoch_number = 100
File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:449, in Accelerator.__init__(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, dispatch_batches, even_batches, use_seedable_sampler, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend)
445 self.scaler = None
446 self.native_amp = False
447 if (
448 self.state.mixed_precision == "fp16"
--> 449 and self.device.type != "cpu"
450 and self.distributed_type not in (DistributedType.DEEPSPEED, DistributedType.MEGATRON_LM)
451 ):
452 self.native_amp = True
453 if self.device.type not in ("xpu", "cuda", "mps", "npu", "xla", "mlu") or is_torch_xla_available(
454 check_is_tpu=True
455 ):
AttributeError: 'NoneType' object has no attribute 'type'
Before for import I use:
from accelerate import Accelerate
accelerator = Accelerator()
and for training
model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader)
with the following at the update step:
accelerator.backward(loss) instead of loss.backward()
I run the following before import: import os from accelerate.utils import write_basic_config
write_basic_config() # Write a config file os._exit(00) # Restart the notebook
Thank a lot for your help. i was not finally able to run accelerate on TPU because my connection to TPU cores on kaggle failed. It is just an internal error. I will continue with standard GPU.
Thank for your time and your kindness. Wish you a nice week.
@Dadja111 I do again believe that this may be stemming from not using the notebook_launcher. What does your full notebook look like? Anything device and torch related need to be declared in your training function
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Recently I come across accelerate library to manage training torch model on kaggle TPU. However, when I use the prepare method of accelerate, i get the following error: 'AcceleratorState' object has no attribute 'fork_launched'
The complete trace is show below:
AttributeError Traceback (most recent call last) Cell In[21], line 11 8 lr = 1e-3 9 optimizer = optim.Adam(model_b.parameters(), lr=lr) ---> 11 model_b, optimizer, train_loader, val_loader = accelerator.prepare(model_b, optimizer, train_loader, val_loader) 12 if WANDB: 13 run = wandb.init(project="Brain", job_type="Transformer", config=args.dict)
File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1263, in Accelerator.prepare(self, device_placement, *args) 1261 # MS-AMP will handle the device placement 1262 deviceplacement = [False for in args] -> 1263 result = tuple( 1264 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 1265 ) 1266 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement)) 1268 if tpu_should_fix_optimizer or (self.mixed_precision == "fp8" and self.fp8_recipe_handler.backend == "TE"): 1269 # 2. grabbing new model parameters
File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1264, in(.0)
1261 # MS-AMP will handle the device placement
1262 deviceplacement = [False for in args]
1263 result = tuple(
-> 1264 self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
1265 )
1266 result = tuple(self._prepare_one(obj, device_placement=d) for obj, d in zip(result, device_placement))
1268 if tpu_should_fix_optimizer or (self.mixed_precision == "fp8" and self.fp8_recipe_handler.backend == "TE"):
1269 # 2. grabbing new model parameters
File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1140, in Accelerator._prepare_one(self, obj, first_pass, device_placement) 1138 return self.prepare_data_loader(obj, device_placement=device_placement) 1139 elif isinstance(obj, torch.nn.Module): -> 1140 return self.prepare_model(obj, device_placement=device_placement) 1141 elif isinstance(obj, torch.optim.Optimizer): 1142 optimizer = self.prepare_optimizer(obj, device_placement=device_placement)
File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:1445, in Accelerator.prepare_model(self, model, device_placement, evaluation_mode) 1443 kwargs = self.ddp_handler.to_kwargs() if self.ddp_handler is not None else {} 1444 model = torch.nn.parallel.DistributedDataParallel(model, **kwargs) -> 1445 elif self.distributed_type == DistributedType.XLA and self.state.fork_launched: 1446 model = xmp.MpModelWrapper(model).to(self.device) 1447 # torch.compile should be called last and only if the model isn't already compiled.
AttributeError: 'AcceleratorState' object has no attribute 'fork_launched