graphcore / poptorch

PyTorch interface for the IPU
https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/
MIT License
176 stars 14 forks source link

Compiling extremely slow and hangs at around 84/100 #8

Closed Lime-Cakes closed 1 year ago

Lime-Cakes commented 1 year ago

There is no error message. I'm trying to do compile a training model using either .compile or .compileAndExport. Both resulted in the same thing. It just compiles slowly, until it reaches around 84/100, where it hangs completely, with cpu usage dropping to zero and remain this way. This is tested in paperspace, using the docker 'graphcore/pytorch-jupyter:3.0.0-ubuntu-20.04-20221025', which should be the latest version of the sdk. Only change to environment was installation of diffusers optimum-graphcore, ipywidgets>=7,<8.

/usr/local/lib/python3.8/dist-packages/optimum/graphcore/ipu_configuration.py:148: UserWarning: The "enable_half_first_order_momentum" parameter is deprecated
  warnings.warn('The "enable_half_first_order_momentum" parameter is deprecated')
/usr/local/lib/python3.8/dist-packages/optimum/graphcore/ipu_configuration.py:140: UserWarning: The "sharded_execution_for_inference" parameter is deprecated, sharded execution is always used during inference
  warnings.warn(
[12:13:07.892] [poptorch::python] [warning] No device set in torch.randn(): forcing to IPU
[12:13:07.896] [poptorch::python] [warning] No device set in torch.randn(): forcing to IPU
[12:13:07.897] [poptorch::python] [warning] No device set in torch.randint(): forcing to IPU
Graph compilation:  84%|████████▍ | 84/100 [1:09:59<20:26]  
callumm-graphcore commented 1 year ago

Hi @Lime-Cakes, in order to solve this we'll need more information about what you're trying to do. In particular, it'd be ideal if you could share the exact code you're running so we can reproduce the issue and investigate.

Long compilation times can sometimes occur when your model is too big for the number of IPUs you are using. You may wish to consult our Memory and Performance Optimisation guide for advice on how to reduce memory use. There is a section in there about reducing compilation time but I think your first step should be making sure your model fits.

Memory and Performance Optimisation Guide: https://docs.graphcore.ai/projects/memory-performance-optimisation/en/latest/index.html

callumm-graphcore commented 1 year ago

To follow up on this: it should be possible for you to download your notebook from Gradient, which you could upload to Github as a gist. Alternatively, you could make the notebook public and share a link here.

Lime-Cakes commented 1 year ago

stable_diffusion_training-error.zip

The zip file have a standalone notebook that contains everything. I strip out dataloader and optimum trainer to ensure it was stuck at compiler. It uses base SD1.4 (diffuser format). I tested with a few different settings, which seemed to all hang, just at different percentage. All takes an extremely long time to compile and hang. The notebook is tested with graphcore/pytorch-jupyter:3.0.0-ubuntu-20.04-20221025

Edit to supply more information: I based the training code off optimum's SD inference example. The inference example can compile, but takes around half an hour to compile. I'm unsure if that time is normal. It seemed most poptorch example I found took a long time to compile ( >15min).

callumm-graphcore commented 1 year ago

Thanks @Lime-Cakes, I'll take a look at this and get back to you when I know more.

callumm-graphcore commented 1 year ago

Hi @Lime-Cakes, sorry for the delay here, but it looks like you've taken our Stable Diffusion example for inference and tried to directly port this to training. I'm afraid I don't think this will work - training will require more memory (gradients, running averages for Adam, activations need to be saved for the backwards pass) and so the current split of the model across IPUs probably won't be adequate.

I will try to find out if anyone has successfully gotten training for Stable Diffusion working on IPUs.

If you want to try figuring out a way to get it working yourself, we have some examples that might be useful: here's UNet training (in TF2) with a diagram of how that model is split up, and there are a lot of examples of transformer training in the optimum-graphcore repo that could be useful. (I believe Stable Diffusion is a combination of UNet and a Transformer - please let me know if I'm wrong.) We also have an extensive guide to Memory and Performance Optimisation on the IPU.

Please let me know if there is anything else I can do to help.

With thanks, Callum

Lime-Cakes commented 1 year ago

Thanks for the update.