huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.43k stars 885 forks source link

CUDA initialization #908

Open Afera672 opened 1 year ago

Afera672 commented 1 year ago

System Info

Hello everybody. I keep encountering the same issue: I use '1.12.1+cu102'and FastAI '2.7.9'.
I need to use the multiple GPUs in our server to train deeper networks with more images. 
___
accelerate env

Traceback (most recent call last):
  File "/home/andrea/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/env.py", line 34, in env_command
    accelerate_config = load_config_from_file(args.config_file).to_dict()
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 63, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 116, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'command_file'

Information

Tasks

Reproduction

Here is the script that I am using:


from fastai.vision.all import from fastai.distributed import from fastai.vision.models.xresnet import *

from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher

def get_msk(o): return pathRflbl+fr'/RfM{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'

numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data print('numeral codes ', numeral_codes)
file = open(path+'/codes.txt', "w+")

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)

notebook_launcher(train, num_processes=4)


It all works until I use notebook launcher. then it comes up with:

ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) ---> 24 notebook_launcher(train, num_processes=4)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 95 raise ValueError( 96 "To launch a multi-GPU training from your notebook, the Accelerator should only be initialized " 97 "inside your training function. Restart your notebook and make sure no cells initializes an " 98 "Accelerator." 99 ) 101 if torch.cuda.is_initialized(): --> 102 raise ValueError( 103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction " 104 "using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA " 105 "function." 106 ) 108 try: 109 mixed_precision = PrecisionType(mixed_precision.lower())

ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA function.


Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).

Do you have any ideas? Do I need to update some version of something?

Expected behavior

if, instead of my data, I use
path = untar_data(URLs.CAMVID_TINY)

I can train up to 4 GPUs, independently and also using xresnet50. The processes seem to run on 4 independent GPUs, but I am not sure yet that each is a chunk of the total and it tries to execute the calculation in parallel as intended (by me). For instance I am not sure that the memory it uses for the whole calculation is the sum of the GPUs memory.

Anyhow, could you please help me in executing this calculation on multiple GPUs?
sgugger commented 1 year ago

cc @muellerzr

Afera672 commented 1 year ago

Merci Sylvain! Je ne lis pas ce que tu penses, mais je comprends de ton message que c'est quelque chose que tu vois suivant. Parfait! J'espère de pouvoir le résoudre grâce à Zachary tres vite. Merci encore!


PS Just me thanking Sylvain for his quick answer pointing me to Zachary. Who I thank from now to take the time to help me here. All in all, if this start working out this company may be interested in drawing a partnership with you guys so that you can actually dedicate some professional time for helping us setting this up for our clients.

Afera672 commented 1 year ago

@sgugger @muellerzr

Afera672 commented 1 year ago

@sgugger @muellerzr OK: the news is that this company is now offering a budget for allowing to have help from you on a professional basis: I think that a few hours should be enough for someone expert in these issues. I do not understand why the dataloader and/or the datablock do not behave like they do with the CAMVID dataset even if the data are the same: pictures and masks. Could you polease help?

muellerzr commented 1 year ago

@Afera672 what version of Accelerate are you using? And can you do echo ~/.cache/huggingface/accelerate/default_config.yml and tell me what it outputs?

muellerzr commented 1 year ago

And also please look at the examples in the fastai docs that showcase how to use this functionality:

https://docs.fast.ai/tutorial.distributed.html

There is an important note there:

It is important to not build the DataLoaders outside of the function, as absolutely nothing can be loaded onto CUDA beforehand.

Afera672 commented 1 year ago

@muellerzr muellerz Thank you for the reply

the echo echo ~/.cache/huggingface/accelerate/default_config.yml outputs: /home/andrea/.cache/huggingface/accelerate/default_config.yml

I should have accelerate2.0 installed but I have not found confirming this

It seems to load the datablick which I hav like this simply:

3 Start training on multiple GPUs on a partallel thread

from accelerate import notebook_launcher

def get_msk(o): return pathRflbl+fr'/RfM{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'

numeral_codes=[i for i in range(0,16)] print('numeral codes ', numeral_codes) #numeral codes understod by FastAI

file = open(path+'/codes.txt', "w+")

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, after_item=ToTensor(), codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(resnet34,dls, dls=TfmdDL(after_item=ToTensor(4,80,80), after_batch=[IntToFloatTensor(), *aug_transforms()], bs=8)) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)


then, when I run (in next cell or in the same):


notebook_launcher(train, num_processes=2) ___it raises exception:

Launching training on 2 GPUs.


ProcessRaisedException Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 notebook_launcher(train, num_processes=2)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:127, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 124 launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU") 126 print(f"Launching training on {num_processes} GPUs.") --> 127 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork") 129 else: 130 # No need for a distributed launch otherwise as it's either CPU or one GPU. 131 if torch.cuda.is_available():

File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method) 195 return context 197 # Loop on join until it returns True or raises an exception. --> 198 while not context.join(): 199 pass

File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:160, in ProcessContext.join(self, timeout) 158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index 159 msg += original_trace --> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/andrea/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in call self.launcher(args) File "/tmp/ipykernel_2495035/3417734586.py", line 21, in train dls = SegmentationDataLoaders.from_label_func( File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/vision/data.py", line 216, in from_label_func res = cls.from_dblock(dblock, fnames, path=path, kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/core.py", line 281, in from_dblock return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/block.py", line 157, in dataloaders return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs) TypeError: fastai.data.core.FilteredBase.dataloaders() got multiple values for keyword argument 'after_item'

----and I do not find how to implement 'after_item' that should reformat all images to the same dimension. I thought this is actually already done in the datablock definition no?

any ideas?

in bypassing, the reason why I do not want to implement 'distributed learning' is that when I did it it opens multiple threads on the same GPU. I need instead to have multiple GPUs working together so that I can train and use deeper networks (ResNEt50 or higher) with many hundreds of images. Right now with ResNet34 it is just not out of memory. This is a segmentation problem. We segment images from satellites.

This company is offering as well a fee for you consulting me/us on how to use this library efficiently since we have not much time and we are trying it from a few weeks already. I hope you have time and let me know if you can take this offer: the HR will send you (and whoever you like to work with you) a contract involving a non-disclosure-agreement.

Anyhow, thank you to have answered to me. Looking forward to your reply. Andrea Fera

muellerzr commented 1 year ago

Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:

def train():
  dls = SegmentationDataLoaders.from_label_func(
    path, 
    bs=8, 
    fnames = get_image_files(path+'/Impng'),
    label_func = get_msk, 
    item_tfms=[ToTensor()],
    batch_tfms=[IntToFloatTensor(), *aug_transforms()]
    codes = np.loadtxt(path+'/codes.txt', dtype=str)
  )
  learn = unet_learner(dls, resnet34)
  with learn.distrib_ctx(in_notebook=True, sync_bn=False):
    learn.fit(10)
Afera672 commented 1 year ago

@Zachary @.***>

Thank you for your insight. I also feared that art is wrong Yet, now if replies like this, regardless if I indent different line 28 or 28. I also varied indentation of line 19, but to no progress. What you think is its problem?

[Graphical user interface, text, application Description automatically generated]

--

From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 15:16 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)

Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:

def train():

dls = SegmentationDataLoaders.from_label_func(

path,

bs=8,

fnames = get_image_files(path+'/Impng'),

label_func = get_msk,

item_tfms=[ToTensor()],

batch_tfms=[IntToFloatTensor(), *aug_transforms()]

codes = np.loadtxt(path+'/codes.txt', dtype=str)

)

learn = unet_learner(dls, resnet34)

with learn.distrib_ctx(in_notebook=True, sync_bn=False):

learn.fit(10)

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358242985&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=D1lcAItQPcn4S2Mytlw5oBOzdHio4PnjNFDHoETMMNg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA45EU7RA24S4FBJYDDWOC677ANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=t3ilFpIvu6f9GVf126gfdJk_CJSwNCHMkgpiXyH3gXA&e=. You are receiving this because you were mentioned.Message ID: @.***>

muellerzr commented 1 year ago

Hi @Afera672, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!

Afera672 commented 1 year ago

@Zachary @.***> Hi Zachary,

Of course I can send you the notebook. Can you simply send me an email address to send it to you?

Thanks!!! Andrea

PS our offer to pay a fee for your consulting services is still open as well.

--

From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 19:41 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)

Hi @Afera672https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Afera672&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=Qub_XAESamKtvppvjV49uMv1yyxzkdUmWNisnxF_DPM&e=, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358679012&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=pD0F5jpAu2pBk5gytlT-dW9c0sLfQO8-Zu41R3HDUks&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA4L7ELMQ43G3BO4KLDWOD6DRANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=JJpAd9E8G2XSb8BV98ge1o24xETrUeE8Scitaq8fUxs&e=. You are receiving this because you were mentioned.Message ID: @.***>

Afera672 commented 1 year ago

@muellerzr sorry...this is me trying to send you the file...

Afera672 commented 1 year ago

@muellerzr I beg your pardon Zach. I am not very well versed with this interface. I try to send it now again. No, it does not allow to attach .iypnb notebooks. I am sorry: it attaches this script only as pdf. Here it is. I hop eit is clear enough...

Multi-GPUs not working ASCI.pdf

muellerzr commented 1 year ago

Thanks @Afera672, your issue is this line, you shouldn't be re-making dataloaders:

learn = unet_learner(resnet34,dls,
 dls=TfmdDL(after_item=ToTensor(4,80,80),
 after_batch=[IntToFloatTensor(), 
*aug_transforms()], bs=8))

To fix, (or at least get further) change that code to be:

learn = unet_learner(resnet34,dls)
Afera672 commented 1 year ago

Thanks @muellerzr .

Yes, I am sorry: I put actually the file before finding that it is indeed wrong. I found an other issue as well. With the codes. Trying to fix it. my problem with such a simple dls is that I do nto know how to make the transform or how to tell it to make all of a certain dimension. This is why I used a datablock before. Which form a datablock needs to have here? the regular one? I'll try. Thanks!!

muellerzr commented 1 year ago

I'd recommend opening a thread on the fastai forums for more help, since the issue is with the framework more than Accelerate specifically :)

https://forums.fast.ai

Afera672 commented 1 year ago

Thank you for the insightful suggestion, @muellerzr , but I have a strange problem I believe: After I run:

Screenshot 2022-12-21 at 11 53 57

The problem is that now Accelerate needs me to use SegmentationDataLoaders . And I need to insert transformations, but I do not know how to do it. Can you send me an example with SegmentationDataLoaders where you insert the 'Items_transform' or 'after_item' in order to standardize the images seen by the algorithm? Here is what it says when I run notebook_launcher:

Screenshot 2022-12-21 at 11 55 08

Thanks for your help!

Afera672 commented 1 year ago

@muellerzr

Hi Zack, I have an important update. I realized that the segmentationDataLoaders.from_label_func() is a function that evidently ember both datablock and dataloader characteristics, so I inserted size-standardization of the images. And it worked, AT FIRST:

Afera672 commented 1 year ago
Screenshot 2022-12-21 at 14 24 09
Afera672 commented 1 year ago

@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors:

Screenshot 2022-12-21 at 14 34 20

Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50... while now I am confined to Resnet34. Which is not bad but... Thank you for your time!

looperEit commented 11 months ago

I also encountered this problem and don't know how to solve it. I know that cuda is guaranteed not to be initialized before running jupyter_laucher. But none of my previous codes were initialized. Or called torch.cuda? What should we do?

ckpt_path = 'baichuan13b_ner'

optimizer = bnb.optim.adamw.AdamW(peft_model.parameters(), lr=6e-05,is_paged=True) #'paged_adamw'

初始化KerasModel

keras_model = KerasModel(peft_model, loss_fn =None, optimizer=optimizer)

加载微调后的权重

keras_model.load_ckpt(ckpt_path)

使用多GPU训练

keras_model.fit_ddp(num_processes=2, train_data=dl_train, val_data=dl_val, epochs=100, patience=10, monitor='val_loss', mode='min', ckpt_path=ckpt_path)


> ValueError                                Traceback (most recent call last)
> Cell In[30], line 12
>       9 keras_model.load_ckpt(ckpt_path)
>      11 # 使用多GPU训练
> ---> 12 keras_model.fit_ddp(num_processes=2,
>      13                     train_data=dl_train,
>      14                     val_data=dl_val,
>      15                     epochs=100,
>      16                     patience=10,
>      17                     monitor='val_loss',
>      18                     mode='min',
>      19                     ckpt_path=ckpt_path)
> 
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/torchkeras/kerasmodel.py:282, in KerasModel.fit_ddp(self, num_processes, train_data, val_data, epochs, ckpt_path, patience, monitor, mode, callbacks, plot, wandb, quiet, mixed_precision, cpu, gradient_accumulation_steps)
>     279 from accelerate import notebook_launcher
>     280 args = (train_data,val_data,epochs,ckpt_path,patience,monitor,mode,
>     281     callbacks,plot,wandb,quiet,mixed_precision,cpu,gradient_accumulation_steps)
> --> 282 notebook_launcher(self.fit, args, num_processes=num_processes)
> 
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/accelerate/launchers.py:116, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
>     113 from torch.multiprocessing.spawn import ProcessRaisedException
>     115 if len(AcceleratorState._shared_state) > 0:
> --> 116     raise ValueError(
>     117         "To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
>     118         "inside your training function. Restart your notebook and make sure no cells initializes an "
>     119         "`Accelerator`."
>     120     )
>     122 if torch.cuda.is_initialized():
>     123     raise ValueError(
>     124         "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
>     125         "using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
>     126         "function."
>     127     )
> 
> ValueError: To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized inside your training function. Restart your notebook and make sure no cells initializes an `Accelerator`.