What is text_encoding, and how do we condition on it?

Hey! If we want to utilize pre-computed text/img embeddings and also text condition on the original caption, how do we go about that?

Currently in the code I see:

https://github.com/lucidrains/DALLE2-pytorch/blob/2db0c9794c33e98df25b84f557a683a8900dfc61/dalle2_pytorch/dalle2_pytorch.py#L1035-L1037

Which implies that you can either pass in text xor text_embeddings to work from tokenized text directly or embeddings only. However, I see there is also an option to also pass text_encodings, but it isn't quite clear to me what the text_encodings are.

Inside the OpenAIClipAdapter I see:

https://github.com/lucidrains/DALLE2-pytorch/blob/2db0c9794c33e98df25b84f557a683a8900dfc61/dalle2_pytorch/dalle2_pytorch.py#L267-L275

where text_encodings is set to self.text_encodings, but I checked the official OpenAI CLIP repo and found no reference to a self.text_encodings, which leads me to believe that self.text_encodings will be None and handled by the following logic inside DiffusionPrior

https://github.com/lucidrains/DALLE2-pytorch/blob/2db0c9794c33e98df25b84f557a683a8900dfc61/dalle2_pytorch/dalle2_pytorch.py#L808

This is in contrast to the XClipAdapter which has the following, which sets text_encoding to a value: https://github.com/lucidrains/DALLE2-pytorch/blob/2db0c9794c33e98df25b84f557a683a8900dfc61/dalle2_pytorch/dalle2_pytorch.py#L178

So... 😄 all of this to say, am I correct in believing that text_encoding is supposed to be the hidden state of the clip text encoder? and If we want to condition on text_encodings with pre-computed embeddings we need to extract a hidden state from a CLIP model and pass that in to DiffusionPrior.forward() to condition on text?

lucidrains / DALLE2-pytorch

What is text_encoding, and how do we condition on it? #93