TextToVideoZeroPipeline unusable after diffusers 0.24 update

Describe the bug

Hello diffusers team,

First of all, thanks for your awesome work !

Here's my problem : after upgrading to diffusers 0.24, i can't succeed loading a model into the TextToVideoZeroPipeline.

.from_pretrained fails with AttributeError: 'bool' object has no attribute 'module'. Did you mean: 'mod'?

Downgrading to diffusers 0.23.1 and peft 0.7.0 to 0.6.2 make the following snippet work again.

Seems to looks like 5992, but with the TextToVideoZeroPipeline.

Am I the only one encountering this behavior ?

Reproduction

import torch
from diffusers import TextToVideoZeroPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float32).to("cpu")

prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("video.mp4", result, fps=4)

Logs

(env) woolverine@testdeploy:~/test$ python3 ./test.py
Loading pipeline components...:   0%|                                                                                                                                                        | 0/7 [00:00<?, ?it/s]`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  5.95it/s]
Traceback (most recent call last):
  File "/home/woolverine/test/./test.py", line 5, in <module>
    pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float32).to("cpu")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woolverine/test/env/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 1357, in from_pretrained
    model = pipeline_class(**init_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woolverine/test/env/lib/python3.11/site-packages/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py", line 314, in __init__
    super().__init__(
  File "/home/woolverine/test/env/lib/python3.11/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 236, in __init__
    self.register_modules(
  File "/home/woolverine/test/env/lib/python3.11/site-packages/diffusers/pipelines/pipeline_utils.py", line 566, in register_modules
    library = not_compiled_module.__module__.split(".")[0]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'bool' object has no attribute '__module__'. Did you mean: '__mod__'?

System Info

Os : Debian 12 (VirtualBox virtual machine) Python version : Python 3.11.2 Diffusers : 0.24 PyTorch : 2.1.0+cpu

Who can help?

No response

Following advices of @TonyLianLong in PR 5993, I confirm that the following modifications seems to solve the initial problem.

diff --git a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
index 0f9ffbeb..08a0136d 100644
--- a/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
+++ b/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
@@ -9,3 +9,3 @@ import torch.nn.functional as F
 from torch.nn.functional import grid_sample
-from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer
+from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer, CLIPVisionModelWithProjection

@@ -311,2 +311,3 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
         feature_extractor: CLIPImageProcessor,
+        image_encoder: CLIPVisionModelWithProjection = None,
         requires_safety_checker: bool = True,
@@ -314,3 +315,11 @@ class TextToVideoZeroPipeline(StableDiffusionPipeline):
         super().__init__(
-            vae, text_encoder, tokenizer, unet, scheduler, safety_checker, feature_extractor, requires_safety_checker
+            vae,
+            text_encoder,
+            tokenizer,
+            unet,
+            scheduler,
+            safety_checker=safety_checker,
+            feature_extractor=feature_extractor,
+            image_encoder=image_encoder,
+            requires_safety_checker=requires_safety_checker
         )

huggingface / diffusers