Open Afera672 opened 1 year ago
cc @muellerzr
Merci Sylvain! Je ne lis pas ce que tu penses, mais je comprends de ton message que c'est quelque chose que tu vois suivant. Parfait! J'espère de pouvoir le résoudre grâce à Zachary tres vite. Merci encore!
PS Just me thanking Sylvain for his quick answer pointing me to Zachary. Who I thank from now to take the time to help me here. All in all, if this start working out this company may be interested in drawing a partnership with you guys so that you can actually dedicate some professional time for helping us setting this up for our clients.
@sgugger @muellerzr
@sgugger @muellerzr OK: the news is that this company is now offering a budget for allowing to have help from you on a professional basis: I think that a few hours should be enough for someone expert in these issues. I do not understand why the dataloader and/or the datablock do not behave like they do with the CAMVID dataset even if the data are the same: pictures and masks. Could you polease help?
@Afera672 what version of Accelerate are you using? And can you do echo ~/.cache/huggingface/accelerate/default_config.yml
and tell me what it outputs?
And also please look at the examples in the fastai docs that showcase how to use this functionality:
https://docs.fast.ai/tutorial.distributed.html
There is an important note there:
It is important to not build the DataLoaders outside of the function, as absolutely nothing can be loaded onto CUDA beforehand.
@muellerzr muellerz Thank you for the reply
the echo echo ~/.cache/huggingface/accelerate/default_config.yml outputs: /home/andrea/.cache/huggingface/accelerate/default_config.yml
I should have accelerate2.0 installed but I have not found confirming this
It seems to load the datablick which I hav like this simply:
from accelerate import notebook_launcher
def get_msk(o): return pathRflbl+fr'/RfM{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'
numeral_codes=[i for i in range(0,16)] print('numeral codes ', numeral_codes) #numeral codes understod by FastAI
file = open(path+'/codes.txt', "w+")
content = str(numeral_codes) file.write(content) file.close()
def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, after_item=ToTensor(), codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(resnet34,dls, dls=TfmdDL(after_item=ToTensor(4,80,80), after_batch=[IntToFloatTensor(), *aug_transforms()], bs=8)) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)
then, when I run (in next cell or in the same):
notebook_launcher(train, num_processes=2) ___it raises exception:
Launching training on 2 GPUs.
ProcessRaisedException Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 notebook_launcher(train, num_processes=2)
File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:127, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 124 launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU") 126 print(f"Launching training on {num_processes} GPUs.") --> 127 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork") 129 else: 130 # No need for a distributed launch otherwise as it's either CPU or one GPU. 131 if torch.cuda.is_available():
File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method) 195 return context 197 # Loop on join until it returns True or raises an exception. --> 198 while not context.join(): 199 pass
File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:160, in ProcessContext.join(self, timeout) 158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index 159 msg += original_trace --> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)
ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/andrea/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in call self.launcher(args) File "/tmp/ipykernel_2495035/3417734586.py", line 21, in train dls = SegmentationDataLoaders.from_label_func( File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/vision/data.py", line 216, in from_label_func res = cls.from_dblock(dblock, fnames, path=path, kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/core.py", line 281, in from_dblock return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/block.py", line 157, in dataloaders return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs) TypeError: fastai.data.core.FilteredBase.dataloaders() got multiple values for keyword argument 'after_item'
----and I do not find how to implement 'after_item' that should reformat all images to the same dimension. I thought this is actually already done in the datablock definition no?
any ideas?
in bypassing, the reason why I do not want to implement 'distributed learning' is that when I did it it opens multiple threads on the same GPU. I need instead to have multiple GPUs working together so that I can train and use deeper networks (ResNEt50 or higher) with many hundreds of images. Right now with ResNet34 it is just not out of memory. This is a segmentation problem. We segment images from satellites.
This company is offering as well a fee for you consulting me/us on how to use this library efficiently since we have not much time and we are trying it from a few weeks already. I hope you have time and let me know if you can take this offer: the HR will send you (and whoever you like to work with you) a contract involving a non-disclosure-agreement.
Anyhow, thank you to have answered to me. Looking forward to your reply. Andrea Fera
Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:
def train():
dls = SegmentationDataLoaders.from_label_func(
path,
bs=8,
fnames = get_image_files(path+'/Impng'),
label_func = get_msk,
item_tfms=[ToTensor()],
batch_tfms=[IntToFloatTensor(), *aug_transforms()]
codes = np.loadtxt(path+'/codes.txt', dtype=str)
)
learn = unet_learner(dls, resnet34)
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fit(10)
@Zachary @.***>
Thank you for your insight. I also feared that art is wrong Yet, now if replies like this, regardless if I indent different line 28 or 28. I also varied indentation of line 19, but to no progress. What you think is its problem?
[Graphical user interface, text, application Description automatically generated]
--
From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 15:16 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)
Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:
def train():
dls = SegmentationDataLoaders.from_label_func(
path,
bs=8,
fnames = get_image_files(path+'/Impng'),
label_func = get_msk,
item_tfms=[ToTensor()],
batch_tfms=[IntToFloatTensor(), *aug_transforms()]
codes = np.loadtxt(path+'/codes.txt', dtype=str)
)
learn = unet_learner(dls, resnet34)
with learn.distrib_ctx(in_notebook=True, sync_bn=False):
learn.fit(10)
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358242985&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=D1lcAItQPcn4S2Mytlw5oBOzdHio4PnjNFDHoETMMNg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA45EU7RA24S4FBJYDDWOC677ANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=t3ilFpIvu6f9GVf126gfdJk_CJSwNCHMkgpiXyH3gXA&e=. You are receiving this because you were mentioned.Message ID: @.***>
Hi @Afera672, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!
@Zachary @.***> Hi Zachary,
Of course I can send you the notebook. Can you simply send me an email address to send it to you?
Thanks!!! Andrea
PS our offer to pay a fee for your consulting services is still open as well.
--
From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 19:41 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)
Hi @Afera672https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Afera672&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=Qub_XAESamKtvppvjV49uMv1yyxzkdUmWNisnxF_DPM&e=, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358679012&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=pD0F5jpAu2pBk5gytlT-dW9c0sLfQO8-Zu41R3HDUks&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA4L7ELMQ43G3BO4KLDWOD6DRANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=JJpAd9E8G2XSb8BV98ge1o24xETrUeE8Scitaq8fUxs&e=. You are receiving this because you were mentioned.Message ID: @.***>
@muellerzr sorry...this is me trying to send you the file...
@muellerzr I beg your pardon Zach. I am not very well versed with this interface. I try to send it now again. No, it does not allow to attach .iypnb notebooks. I am sorry: it attaches this script only as pdf. Here it is. I hop eit is clear enough...
Thanks @Afera672, your issue is this line, you shouldn't be re-making dataloaders:
learn = unet_learner(resnet34,dls,
dls=TfmdDL(after_item=ToTensor(4,80,80),
after_batch=[IntToFloatTensor(),
*aug_transforms()], bs=8))
To fix, (or at least get further) change that code to be:
learn = unet_learner(resnet34,dls)
Thanks @muellerzr .
Yes, I am sorry: I put actually the file before finding that it is indeed wrong. I found an other issue as well. With the codes. Trying to fix it. my problem with such a simple dls is that I do nto know how to make the transform or how to tell it to make all of a certain dimension. This is why I used a datablock before. Which form a datablock needs to have here? the regular one? I'll try. Thanks!!
I'd recommend opening a thread on the fastai forums for more help, since the issue is with the framework more than Accelerate specifically :)
Thank you for the insightful suggestion, @muellerzr , but I have a strange problem I believe: After I run:
The problem is that now Accelerate needs me to use SegmentationDataLoaders . And I need to insert transformations, but I do not know how to do it. Can you send me an example with SegmentationDataLoaders where you insert the 'Items_transform' or 'after_item' in order to standardize the images seen by the algorithm? Here is what it says when I run notebook_launcher:
Thanks for your help!
@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors:
Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50... while now I am confined to Resnet34. Which is not bad but... Thank you for your time!
I also encountered this problem and don't know how to solve it. I know that cuda is guaranteed not to be initialized before running jupyter_laucher. But none of my previous codes were initialized. Or called torch.cuda? What should we do?
ckpt_path = 'baichuan13b_ner'
optimizer = bnb.optim.adamw.AdamW(peft_model.parameters(), lr=6e-05,is_paged=True) #'paged_adamw'
keras_model = KerasModel(peft_model, loss_fn =None, optimizer=optimizer)
keras_model.load_ckpt(ckpt_path)
keras_model.fit_ddp(num_processes=2, train_data=dl_train, val_data=dl_val, epochs=100, patience=10, monitor='val_loss', mode='min', ckpt_path=ckpt_path)
> ValueError Traceback (most recent call last)
> Cell In[30], line 12
> 9 keras_model.load_ckpt(ckpt_path)
> 11 # 使用多GPU训练
> ---> 12 keras_model.fit_ddp(num_processes=2,
> 13 train_data=dl_train,
> 14 val_data=dl_val,
> 15 epochs=100,
> 16 patience=10,
> 17 monitor='val_loss',
> 18 mode='min',
> 19 ckpt_path=ckpt_path)
>
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/torchkeras/kerasmodel.py:282, in KerasModel.fit_ddp(self, num_processes, train_data, val_data, epochs, ckpt_path, patience, monitor, mode, callbacks, plot, wandb, quiet, mixed_precision, cpu, gradient_accumulation_steps)
> 279 from accelerate import notebook_launcher
> 280 args = (train_data,val_data,epochs,ckpt_path,patience,monitor,mode,
> 281 callbacks,plot,wandb,quiet,mixed_precision,cpu,gradient_accumulation_steps)
> --> 282 notebook_launcher(self.fit, args, num_processes=num_processes)
>
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/accelerate/launchers.py:116, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
> 113 from torch.multiprocessing.spawn import ProcessRaisedException
> 115 if len(AcceleratorState._shared_state) > 0:
> --> 116 raise ValueError(
> 117 "To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
> 118 "inside your training function. Restart your notebook and make sure no cells initializes an "
> 119 "`Accelerator`."
> 120 )
> 122 if torch.cuda.is_initialized():
> 123 raise ValueError(
> 124 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
> 125 "using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
> 126 "function."
> 127 )
>
> ValueError: To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized inside your training function. Restart your notebook and make sure no cells initializes an `Accelerator`.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Here is the script that I am using:
from fastai.vision.all import from fastai.distributed import from fastai.vision.models.xresnet import *
from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher
def get_msk(o): return pathRflbl+fr'/RfM{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'
numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data print('numeral codes ', numeral_codes)
file = open(path+'/codes.txt', "w+")
Saving the array in a text file
content = str(numeral_codes) file.write(content) file.close()
def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)
notebook_launcher(train, num_processes=4)
It all works until I use notebook launcher. then it comes up with:
ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) ---> 24 notebook_launcher(train, num_processes=4)
File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 95 raise ValueError( 96 "To launch a multi-GPU training from your notebook, the
Accelerator
should only be initialized " 97 "inside your training function. Restart your notebook and make sure no cells initializes an " 98 "Accelerator
." 99 ) 101 if torch.cuda.is_initialized(): --> 102 raise ValueError( 103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction " 104 "usingtorch.cuda
in any cell. Restart your notebook and make sure no cells use any CUDA " 105 "function." 106 ) 108 try: 109 mixed_precision = PrecisionType(mixed_precision.lower())ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using
torch.cuda
in any cell. Restart your notebook and make sure no cells use any CUDA function.Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).
Do you have any ideas? Do I need to update some version of something?
Expected behavior