[Discussion] Use image instead of or additionally to text features

NotNANtoN commented 3 years ago

Hey, thanks a lot for this repo! I've been playing around with this a lot and by reducing the img size to 256 pixels I can generate some amazing images using 8GB of VRAM.

I was thinking of a project in which I combine image and text features into a single feature vector to then generate an image using the SIREN network representing both at the same time. For this, we would need to extract the features from a given img instead of a text.

In general, there is quite some inefficiency in the current code, as the text encoding is recalculated during each DeepDaze train_step, even though the text does not change.

I would recommend that DeepDaze (or the Imagine class) can take a CLIP feature vector as an input. This feature vector can simply be saved and used in the SIREN loss calculation. Using some kind of set_feature_vector this vector could be overridden. Furthermore, there should of course be a backwards-compatible mode that simply takes a text as an input and saves the corresponding feature vector, or to now also add the option to add an image as an input.

If there is interesting from @lucidrains or other people I could submit a pull request for this once I'm done implementing it. If, on the other hand, you have tested using image CLIP features instead of text features to generate SIREN images, please let me know!

lucidrains commented 3 years ago

@NotNANtoN Hello! I've addressed your performance concern of the text being embedded repeatedly - Thanks for catching that :) I'll think about your other suggestions once the weekend is over!

NotNANtoN commented 3 years ago

Well, that was quick! Great!

I might try the img feature approach this evening, if I get to it.

One part that I don't really get is that the preceptor (CLIP) is initialized once the Imagine class is imported - did you want to avoid re-loading it for every new instantiation of the Imagine class? In my use-cases this does not happen often, hence I would prefer to load CLIP within the Imagine or Deepdaze class and store it in a field.

NotNANtoN commented 3 years ago

I opened a pull request for this feature: #48

lucidrains / deep-daze

[Discussion] Use image instead of or additionally to text features #47