Runtime error - Githubissues

ShyFoo commented 1 year ago

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [512, 50]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

thorinf commented 1 year ago

Have any modifications to the code been made? I'm not sure where a Tensor of width 50 is. All the Tensors that need to be backpropagated should be floats.

ShyFoo commented 1 year ago

Have any modifications to the code been made? I'm not sure where a Tensor of width 50 is. All the Tensors that need to be backpropagated should be floats.

Yes. I changed the code "input_ids.masked_fill" to "input_ids.maskedfill". I found this operation caused the above Runtime error. Here are two lines of code that can work.

input_ids = input_ids.data.clone()
input_ids.masked_fill_(torch.logical_not(diff_mask),` -100)

What is the meaning of "w"? I replaced it with input_ids, so the size of it is [512, 50]. If I want to add conditional features to the diffusion model, what should I do? Replace self_condition with them?

thorinf commented 1 year ago

I'd used w for word, but I agree that input_ids is a much better variable name. I think I'll update to that, thanks.

It might depend on the conditioning you want to do. What is the task you want to do, roughly?

What's currently implemented is a simpler version of CDCD, but only for training. There are 3 inputs concatenated together; noised sequence, conditional sequence, and self-conditional sequence. For indexes where we are conditioning, the noised sequence and self-cond sequence are zero vectors, and for indexes where we are not conditioning the conditional sequence is zero vectors. In the reverse diffusion method, the conditional sequence is set to a zero tensor, so it's currently hard-coded to not do conditional inference. It is my plan to change this, but currently working on getting better training runs.

If you are looking to condition in another way, like for NMT, then the method is a bit different.

In other papers, the conditioning method is a little different and may be better. Some implementations just re-inject the conditional vectors at each step of reverse diffusion.

ShyFoo commented 1 year ago

I want to conduct a caption task using it. Which do you think is better, replacing "x_cond" with the conditional vectors I wanted or concatenating them directly?

ShyFoo commented 1 year ago

Another question. "x_cond"(i.e. x_0 embedding) could not be available in the reverse process, so is it still necessary to use it in training? Thanks for your helpful advice.

ShyFoo commented 1 year ago

Another question. "x_cond"(i.e. x_0 embedding) could not be available in the reverse process, so is it still necessary to use it in training? Thanks for your helpful advice.

I noticed that you used a low probability, fully random, conditional mask to compute loss, so most of the positions of "x_cond" were filled with zeros. Is this for approximating the inference process?

thorinf commented 1 year ago

Captioning an image?
It's not necessary for training, but the idea is that for reverse process it will be added to use conditional inputs.
Its more of a demo/placeholder for conditioning to be truly implemented. In CDCD they use different types of masking so that they can do conditioning at inference. This is somewhere I'm looking to move onto but currently the model I have trained isn't performing as well as I want it to.

Generally I would recommend training the model without modifications first, if you have done so I would really appreciate knowing the results you have.

ShyFoo commented 1 year ago

Yes. I'm trying to inject images as conditions into the diffusion model, but my results are not good in the Diffusion-LM setting. I have no idea how much improvement the self-conditioning will bring. I will try replacing "x_cond" with the conditional vectors and look at what the results will be.

thorinf commented 1 year ago

Do you have results for generating text, without image conditioning?

ShyFoo commented 1 year ago

Not yet. I'm conducting an experiment now.

ShyFoo commented 1 year ago

In terms of diff_loss and recon_loss, they are convergent.

thorinf commented 1 year ago

Ok, let me know how the run goes.

WRT image embedding conditioning, then it's really up to you. I'm not sure what the best techniques are for this; maybe adding the embedding, concatenating it, or mixing it in via attention. The architecture of the Diffusion LM shouldn't restrict how you perform this.

ShyFoo commented 1 year ago

RuntimeError: Function 'AddmmBackward0' returned nan values in its 2th output. Sometimes it reports such an error.

ShyFoo commented 1 year ago

I thought your mask strategy has room for improvement because I often found some samples of a batch full of False masks, which may reduce the training efficiency.

thorinf commented 1 year ago

I found that the training is more stable with weight decay set. Default of 0.01 seems to work, but other values may be better. With it set to 0.0 I did find that the training would Nan eventually.

The mask is inverted, so where it's true conditioning it used. Full false means that no conditioning is used, so all of the tokens are used in the loss for that sequence.

ShyFoo commented 1 year ago

OK. I will try the steps you suggested. But my model has not converged so far as the reconstruction loss is still high by changing the learning rate.

ShyFoo commented 1 year ago

Hello. I have a question about the part of time embedding in the Transformer layer. What are the roles of gammas and betas? I noticed previous methods add time embeddings to x_0 directly. Is it used to scale the weights of various layers?

thorinf commented 1 year ago

There are some notes in the readme about the gamma/beta params. It's taken from CDCD framework, which is inspired by FiLM. Concatenating/adding the time embeddings instead is also an option. Concatenating and adding will be fairly equivalent since proj(cat(X,embT)) is similar to proj_A(X) + proj_B(embT).

ShyFoo commented 1 year ago

I got it. An interesting thing I noticed is you didn't use q(x_t_1|x_0, x_t) in inference. In the standard Gaussian diffusion models, we always use a posteriori estimate to reverse the diffusion process, but you didn't do that. I thought q(x_t_1|x_0, x_t) may be equal to self-conditioning. Wouldn't that lead to more estimation errors?

ShyFoo commented 1 year ago

In my preliminary experiments, this model seems not to perform well in the image caption tasks. The generation result sucks. I have tried to adjust some hyperparameters, such as learning rate and embedding dimension, but a low BLEU-4 score.

thorinf commented 1 year ago

I'm not sure what you mean by inference not using posteriori. Which line(s) does this refer to? Do you have a suggested change? The algorithm I've used is meant to be the same as Analog Bits. The main input to the estimator is 3 tensors concatenated: x_t, conditioning, and self-conditioning. x_t is the output from the previous sampling step. I have not implemented conditioning at inference, so this tensor is set to zero.

Does the model at least generate captions that have relevant words? even if the sequence doesn't make sense? This model does take a lot longer to train than an auto-regressive language model.

ShyFoo commented 1 year ago

It might be a good idea to read this paper Diffusion-LM. In this paper, q(x_t_1|x_0, x_t) is used to estimate the next step. I know that x_t is the output from the previous sampling step. And self-conditioning? Doesn't it also come from the previous sampling step in inference?

ShyFoo commented 1 year ago

COCO_val2014_000000249623 generated text: a teddy bear going up see the readingers real shelf. It seems to have some relevant words, like bear, but no a teddy bear actually.

thorinf commented 1 year ago

I have read the Diffusion-LM paper. The q is not used to estimate the next step, q is the forward-diffusion process. The algorithm is outlined in the final paragraph of section 4.2. The model predicts the embeddings directly, they are clamped to the nearest word embedding, then the next step is a computed as the weighted sum between the clamped prediction and some added noise. The formula they use is the same as the forward-diffusion process. The sampling procedure here is different to what I've implemented, so I'll try what they suggest - it's quite similar to the ddim_step.

It should look a bit like this:

def diff_lm_step(self, x_t, x_0_estimation, t_now, t_next):
    gamma_next = self.gamma(t_next).unsqueeze(1).unsqueeze(1)
    mean_weight = torch.sqrt(gamma_next)
    std = torch.sqrt(1 - gamma_next)
    z = torch.randn_like(x_0_estimation)
    return (mean_weight * x_0_estimation) + (z * std)

Sampling with this method sort of works, but currently it heavily overpredicts [i, dont, know, if, you, going] on my undertrained model:

"i'm going to do you've been going home to me i just don't know what you do, but i don't know any of you, but i don't want to know you would take me, but i don't know you do i want you 
i'm not going to know when you saying, i want to know if i'm going to i don't tell me, but it's not okay, i don't know i know that, just it on you i don't know what if i say that is here

If you wish to use the clamping trick, this can actually be done by altering the interpolation temperature. This will make the softmax distribution more like a one-hot, which will effectively clamp the interpolation to a single embedding. This will make the predictions even more likely to overpredict more common words.

Self-conditioning isn't used in this paper, but is becoming more common to find in many models. Analog Bits and CDCD use this method.

Yes, seems like the model has made some progress, but isn't really accurate enough. Seems like it's learnt the dependency between bear and teddy too much.

ShyFoo commented 1 year ago

Thanks for your explanations. I just don't understand the effect of self-conditioning. For accelerating inference or improving denoise accuracy?

thorinf commented 1 year ago

Improving performance. But, this could also lead to accelerating inference. If your model becomes better, then you may not need to run as many inference steps.

ShyFoo commented 1 year ago

Fine. That's what I was thinking.

thorinf commented 1 year ago

Take a look at the newest changes. Some changes are small, but there was an error in the interpolation step for example. L2 normalisation should be placed on the embeddings before the weighted sum, not after - you may notice better results.

ShyFoo commented 1 year ago

def get_embeddings(self, ids): x = self.embedding(ids) x = F.normalize(x, dim=-1) * math.sqrt(self.embedding_dim) return x Is it here?

thorinf commented 1 year ago

  def interpolate(self, x):
      logits = self.get_logits(x) / self.interpolate_temperature
      weights = logits.softmax(dim=-1)
      e = self.embedding.weight
      e = F.normalize(e, dim=-1) * math.sqrt(self.embedding_dim)
      interpolated = torch.einsum('nle,ed->nld', weights, e)
      return interpolated

ShyFoo commented 1 year ago

Thanks. But I have a question about the embedding normalization of this paper Difformer. The authors claimed that they put an LN layer on top of the embedding layer, but you didn't do it. ` def get_embeddings(self, ids): x = self.embedding(ids) x = self.norm_latent(x) x = F.normalize(x, dim=-1) * math.sqrt(self.embedding_dim) return x

def get_logits(self, x): x = self.dropout(x) x = self.lm_head(x) return x ` I would recommend that you take a try. Please let me know the results, thanks!

ShyFoo commented 1 year ago

What is the effect of self.normalize? Why does it seem less effective when I use it? The clamp trick does work!

ShyFoo commented 1 year ago

How high is your bleu now?

thorinf commented 1 year ago

Yeah, the Difformer does use LNorm, I've been training a bit with it recently. The CDCD is using L2 normalisation. It could be the case that L2 Normalisation (and rescaling) is not the best solution.

I'm not testing BLEU currently, I can see the sentences and I know they aren't really making sense:

conniel mulled her she would, like regret look to us when she flughed "i know why the children would give me been coming with yourself for till them and find, so you should have from wedledine s

Because of the hardware I have available I can't tell if this is because the model is undertrained, or there is a deeper problem. There seems to be an issue with the model understanding longer temporal information, but perhaps its the sampling stage that is not good enough.

ShyFoo commented 1 year ago

To be honest, I didn't think that applying diffusion models to natural language made sense. I have been working on it for a few months, but my experiments have failed. Therefore, I have re-examined the possibility of using diffusion models for natural language, and I am now ready to switch to working on large language models. If you are interested, we can get in touch with each other via email.

thorinf commented 1 year ago

LLMs compared to this model is a totally different scale, LLMs are like 1B+, maybe you mean an AR-LM? Diffusion isn't currently better at NLP, in fact its not really competitive at all. But there's some reasons by Diffusion LMs are interesting:

Although AR models can be controlled somewhat, there is one important Difference… The Diffusion model can control the distribution for all tokens, whereas an AR model can only have sequentially conditioned control. Think of it like this: After decoding some tokens we’ve arrived at a bad one, how many of the previous tokens were wrong given we’ve arrived here? “Running through the park and barking, everyone was watching the dog”. If we do not wish to mention dogs, then the build up is also less likely. The process is exponentially divergent, exploring the ‘nodes’ of decoding isn’t viable. Controlling early tokens away from a certain full sequences can’t be done.
Once you can do Diffusion on discrete data, you can apple the technique to other datasets, for example graphs. A lot of best practices in ML are generated from NLP models, so using text as a dataset is a great training ground to improve Diffusion for discrete data.

ShyFoo commented 1 year ago

Yes. I take your point. But with the appearances of ChatGPT and other extremely large models over 100B even bigger, some problems you mentioned could be solved well.

thorinf commented 1 year ago

It's hard to say what the future models will be. Diffusion is very new, it only beat GANs on image generation in 2021. There's still a lot of progress that may be made, in fact recently there has been this recent paper on Consistency Models where you can generate in a single step or multi step sampling. I think the control issue could become more important, especially in multi-modal cases. Some very high-profile researchers also say that AR-LLMs have these issues.

Your use case is labelling images? What could be done for Diffusion is to just train on a very large text corpus. You then have a core Diffusion model that is hopefully very good at generating text. It's then possible to train 2 classifiers, one for your image, and one for your labels, and train them to generate the same embedding for paired data. Then finally at inference, you can generate an embedding for an image, and use classifier guidance with your text classifier and embedding on the Diffusion generation. In this way you have something very modular, and you haven't needed to fine-tune your Diffusion model at all. This is just an idea though, we may never use models like this extensively.

ShyFoo commented 1 year ago

Consistency Models Yeah, I also read this paper. It is awesome and may have important implications for future research.

thorinf commented 1 year ago

Take a look at my latest repo 👍

thorinf / simple-diffusion-lm

Runtime error #2