openai / glide-text2im

GLIDE: a diffusion-based text-conditional image synthesis model
MIT License
3.54k stars 503 forks source link

While running the clip_guided notebook in CPU mode I get: "RuntimeError - Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead" #28

Closed illtellyoulater closed 2 years ago

illtellyoulater commented 2 years ago

When I run clip_guided notebook in CPU mode, I get the following error at the "Sample from the base model" cell:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_9272/4093479580.py in <module>
     20 # Sample from the base model.
     21 model.del_cache()
---> 22 samples = diffusion.p_sample_loop(
     23     model,
     24     (batch_size, 3, options["image_size"], options["image_size"]),

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
    387         """
    388         final = None
--> 389         for sample in self.p_sample_loop_progressive(
    390             model,
    391             shape,

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample_loop_progressive(self, model, shape, noise, clip_denoised, denoised_fn, cond_fn, model_kwargs, device, progress)
    439             t = th.tensor([i] * shape[0], device=device)
    440             with th.no_grad():
--> 441                 out = self.p_sample(
    442                     model,
    443                     img,

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in p_sample(self, model, x, t, clip_denoised, denoised_fn, cond_fn, model_kwargs)
    351         )  # no noise when t == 0
    352         if cond_fn is not None:
--> 353             out["mean"] = self.condition_mean(cond_fn, out, x, t, model_kwargs=model_kwargs)
    354         sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
    355         return {"sample": sample, "pred_xstart": out["pred_xstart"]}

c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in condition_mean(self, cond_fn, *args, **kwargs)
     95 
     96     def condition_mean(self, cond_fn, *args, **kwargs):
---> 97         return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
     98 
     99     def condition_score(self, cond_fn, *args, **kwargs):

c:\users\alf\downloads\glide-text2im\glide_text2im\gaussian_diffusion.py in condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs)
    287         This uses the conditioning strategy from Sohl-Dickstein et al. (2015).
    288         """
--> 289         gradient = cond_fn(x, t, **model_kwargs)
    290         new_mean = p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float()
    291         return new_mean

c:\users\alf\downloads\glide-text2im\glide_text2im\respace.py in __call__(self, x, ts, **kwargs)
    122         new_ts_2 = map_tensor[ts.ceil().long()]
    123         new_ts = th.lerp(new_ts_1, new_ts_2, frac)
--> 124         return self.model(x, new_ts, **kwargs)

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in cond_fn(x, t, grad_scale, **kwargs)
     57             with torch.enable_grad():
     58                 x_var = x.detach().requires_grad_(True)
---> 59                 z_i = self.image_embeddings(x_var, t)
     60                 loss = torch.exp(self.logit_scale) * (z_t * z_i).sum()
     61                 grad = torch.autograd.grad(loss, x_var)[0].detach()

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\model_creation.py in image_embeddings(self, images, t)
     47 
     48     def image_embeddings(self, images: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
---> 49         z_i = self.image_encoder((images + 1) * 127.5, t)
     50         return z_i / (torch.linalg.norm(z_i, dim=-1, keepdim=True) + 1e-12)
     51 

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, image, timesteps, return_probe_features)
    483     ) -> torch.Tensor:
    484         n_batch = image.shape[0]
--> 485         h = self.blocks["input"](image, t=timesteps)
    486 
    487         for i in range(self.n_xf_blocks):

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

c:\users\alf\downloads\glide-text2im\glide_text2im\clip\encoders.py in forward(self, x, t)
    124             self.pred_state[None, None].expand(x.shape[0], -1, -1)
    125             if self.n_timestep == 0
--> 126             else F.embedding(cast(torch.Tensor, t), self.w_t)[:, None]
    127         )
    128         x = torch.cat((sot, x), dim=1) + self.w_pos[None]

~\.conda\envs\glide-text2im\lib\site-packages\torch\nn\functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1850         # remove once script supports set_grad_enabled
   1851         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1852     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1853 
   1854 

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.FloatTensor instead (while checking arguments for embedding)

Can anyone help? Thanks!

woctezuma commented 2 years ago

Not sure what is happening here, but you should try to aim for a GPU if possible.

See the comment in the notebook:

# This notebook supports both CPU and GPU.
# On CPU, generating one sample may take on the order of 20 minutes.
# On a GPU, it should be under a minute.

CPU mode takes 20 times more computation time than GPU mode.

illtellyoulater commented 2 years ago

I know, but my current GPU doesn't have enough VRAM... that's why I was running in CPU mode. In my case I'm getting a new GPU soon, but think it would still be cool if this could still work on CPU...

woctezuma commented 2 years ago

Yes, sure. In the meantime, try to use a free GPU on Google Colab.

illtellyoulater commented 2 years ago

@woctezuma I finally got hold of a new GPU with 6 GB VRAM... so I am now running again the clip_guided notebook in GPU mode, but I am seeing exactly the same error I documented above...

woctezuma commented 2 years ago

Try:

illtellyoulater commented 2 years ago

Thanks! I saw them already but I don't have the necessary ML & rel. libs knowledge to properly make use of them... I also already tried kind of blindly playing with those types and their conversion, but without success... Honestly, I see it very hard I can come up with something useful just by myself... 🤷‍♂️

woctezuma commented 2 years ago

It could be just a simple change of this line:

https://github.com/openai/glide-text2im/blob/9cc8e563851bd38f5ddb3e305127192cb0f02f5c/glide_text2im/clip/encoders.py#L123-L127

You could try to replace:

F.embedding(cast(torch.Tensor, t), self.w_t)

with either:

F.embedding(cast(torch.Tensor, t.long()), self.w_t)

or:

F.embedding(cast(torch.Tensor, t).long(), self.w_t)
illtellyoulater commented 2 years ago

Ok, thanks! Now at least in CPU mode it works! In GPU mode a completely black image is generated (at some points tensors become NaN), but I'll open another thread for that, as it must be caused by a different problem.