hotshotco / Hotshot-XL

✨ Hotshot-XL: State-of-the-art AI text-to-GIF model trained to work alongside Stable Diffusion XL
https://hotshot.co
Apache License 2.0
982 stars 77 forks source link

Using --video_length= and/or --video_duration= always yields "Warrning - setting num_images_per_prompt = 1....... #9

Closed CCpt5 closed 8 months ago

CCpt5 commented 8 months ago

Edit: Working either way and joined the discord - will get help there if needed. thx!

-- I just cloned/started playing w/ the repo, so forgive me if this is user error, but just wanted to mention that when I use the length and/or duration arguments there's a warning saying "setting num_images_per_prompt = 1". I first thought this was going to produce a single image GIF, but on checking again it does seem as though it's outputting the correct number of frames. When I don't add those arguments that warning is not presented.

(venv) D:\SDXL\Hotshot-XL>python inference.py --video_length=24 --video_duration=3000 --prompt="Will Smith eating spaghetti, hd, high quality" --output "hotshottest24-3000.gif"
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  2.38it/s]
Warning - setting num_images_per_prompt = 1 because video_length = 24
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:28<00:00,  1.05it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 29.20it/s]

(venv) D:\SDXL\Hotshot-XL>

I also get a runtime error if I try to generate frames/length over ~3seconds (24frames), but I assume that's due to VRam/resource issues. (Running a 4090 24gb VRam, 64gb cpu ram).

(venv) D:\SDXL\Hotshot-XL>python inference.py --video_length=40 --video_duration=5000 --prompt="Will Smith eating spaghetti, hd, high quality" --output "hotshottest40-5000.gif"
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  2.34it/s]
Warning - setting num_images_per_prompt = 1 because video_length = 40
  0%|                                                                                                                                                                                                                | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\SDXL\Hotshot-XL\inference.py", line 223, in <module>
    main()
  File "D:\SDXL\Hotshot-XL\inference.py", line 203, in main
    images = pipe(args.prompt,
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\pipelines\hotshot_xl_pipeline.py", line 825, in __call__
    noise_pred = self.unet(
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\unet.py", line 849, in forward
    sample, res_samples = downsample_block(hidden_states=sample,
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\unet_blocks.py", line 475, in forward
    hidden_states = temporal_attention(hidden_states, encoder_hidden_states=encoder_hidden_states)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\transformer_temporal.py", line 123, in forward
    hidden_states = block(hidden_states, encoder_hidden_states=encoder_hidden_states, number_of_frames=f)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\transformer_temporal.py", line 181, in forward
    hidden_states = block(
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\transformer_temporal.py", line 59, in forward
    hidden_states = self.pos_encoder(hidden_states, length=number_of_frames)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\SDXL\Hotshot-XL\hotshot_xl\models\transformer_temporal.py", line 47, in forward
    hidden_states = hidden_states + self.positional_encoding[:, :length]
RuntimeError: The size of tensor a (40) must match the size of tensor b (24) at non-singleton dimension 1