Windows - RuntimeError: No available kernel. Aborting execution.

SoftologyPro commented 1 month ago

Trying to get this working under Windows.

I clone the repository, create a new venv and try and install requirements.txt. xformers fails with

Collecting xformers==0.0.28.post1
  Downloading xformers-0.0.28.post1.tar.gz (7.8 MB)
     ???????????????????????????????????????? 7.8/7.8 MB 6.6 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  ¦ exit code: 1
  ?-> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\Jason\AppData\Local\Temp\pip-install-hg2meh3o\xformers_89fe3807baaa4f888830dbd3996a3b04\setup.py", line 24, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
?-> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

If I try and install torch first before requirements it still fails. So, I remove xformers and let the rest of the requirements finish. Once they are done I install xformers and torch using...

pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip uninstall -y torch
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts torch==2.4.0+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Then when I run single_inference I get

  0%|                                                                                                                                                                                                                                                | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\Tests\Allegro\Allegro\single_inference.py", line 99, in <module>
    single_inference(args)
  File "D:\Tests\Allegro\Allegro\single_inference.py", line 65, in single_inference
    out_video = allegro_pipeline(
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\allegro\pipelines\pipeline_allegro.py", line 773, in __call__
    noise_pred = self.transformer(
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\allegro\models\transformers\transformer_3d_allegro.py", line 331, in forward
    hidden_states = block(
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\allegro\models\transformers\block.py", line 1093, in forward
    attn_output = self.attn1(
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\venv\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Tests\Allegro\Allegro\allegro\models\transformers\block.py", line 553, in forward
    return self.processor(
  File "D:\Tests\Allegro\Allegro\allegro\models\transformers\block.py", line 824, in __call__
    hidden_states = F.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

What version of xformers and torch do I need to get this to work under Windows?

MinervaArgus commented 1 month ago

I successfully installed the correct versions of torch with CUDA 12.4 enabled through torch and xformers 0.0.28.post1 and still get this error.

Grownz commented 1 month ago

Check issue https://github.com/rhymes-ai/Allegro/issues/17

SoftologyPro commented 1 month ago

Changing line 824 in Allegro/allegro/models/transformers/block.py from with sdpa_kernel(SDPBackend.FLASH_ATTENTION): to with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True): gets past the no available kernel error. But, has an estimated 2 hours 40 minutes to finish on a 4090. In the end it took over 3 hours to finish the 5 second default settings.

SoftologyPro commented 1 month ago

Changing line 13 in single_inference.py from dtype=torch.bfloat16 to dtype=torch.float16 (as also shown in https://github.com/rhymes-ai/Allegro/issues/17) will take an estimated 19 hours(!!) so do not try that change.

SoftologyPro commented 1 month ago

Are there any other possible ways we can get this down to a reasonable time on a 24GB consumer GPU?

randaller commented 1 month ago

with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):

this helped, but about 4 hours to an end on 3090 :) with --enable_cpu_offload

SoftologyPro commented 1 month ago

Adding the --enable_cpu_offload argument to single_inference.py gets the estimated time down to 1 hour 40 minutes on a 24GB 4090.

nightsnack commented 1 month ago

@SoftologyPro Seems make sense. I tested on H100 enable-cpu-offload a single 100 steps video takes 1h10min. That's why I wrote the inference time will increase significantly Btw, do you have more than one 4090? I'm going to release the multi-card inference code. Context-parallel seems helps a lot with 4090.

SoftologyPro commented 1 month ago

No, I only have a single 4090. This interest came from a request for me to support Allegro in Visions of Chaos. But if it takes 2 hours on the best consumer GPU it is too slow for local Windows. If some speed breakthrough is made I will be happy to include it.

nightsnack commented 1 month ago

@SoftologyPro Currently I have no idea. I suggest the method of distillation to reduce the inference steps like reduce from 100 steps to 4 steps, but it harms the quality severely.

rhymes-ai / Allegro

Windows - RuntimeError: No available kernel. Aborting execution. #25