mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
MIT License
136 stars 18 forks source link

clarifying differences between available models #18

Open zeke opened 2 years ago

zeke commented 2 years ago

Hi @mehdidc 👋🏼 I'm a new team member at @replicate.

I was trying out your model on replicate.ai and noticed that the names of the models are a bit cryptic, so it's hard to know what differences to expect when using each:

Screen Shot 2021-09-23 at 6 21 40 PM

Here's where those are declared:

https://github.com/mehdidc/feed_forward_vqgan_clip/blob/dd640c0ee5f023ddf83379e6b3906529511ce025/predict.py#L10-L14

Looking at the source for cog's Input class it looks like options can be a list of anything:

options: Optional[List[Any]] = None

I'm not sure if this is right, but maybe this means that each model could be declared as a tuple with an accompanying label:

MODELS = [
    ("cc12m_32x1024_vitgan_v0.1.th", "This model does x"),
    ("cc12m_32x1024_vitgan_v0.2.th" "This model does y"),,
    ("cc12m_32x1024_mlp_mixer_v0.2.th", "This model does z"),
]

We could then display those labels on the model form on replicate.ai to make the available options more clear to users.

Curious to hear your thoughts!

cc @CJWBW @bfirsh @andreasjansson

mehdidc commented 2 years ago

Hi @zeke, sorry for my late answer, thanks for the proposition, you are absolutely right, the model names are not very informative. The thing is that the models are doing the same thing in a sense (also trained on the same prompts dataset), it's just that the architecture is different (vitgan vs mlp_mixer) and between 0.1 and 0.2 I used different set of data augmentations. The reason they are provided altogether is that the user might prefer one option over the other one for a specific prompt. One way to avoid the naming would be to to not provide model choice explicitly, but rather, display a grid of images as an output like in ICGAN (https://replicate.ai/arantxacasanova/ic_gan), where the image of each cell of the grid would be the generated image from a model.

So I am not totally sure, I will think about it, if you or anyone have any propositions, would be glad to hear from you.

afiaka87 commented 2 years ago

@mehdidc @zeke

The distinguishing information is: modelType: ["mlp_mixer", "vitgan"] -> basically "experimental (mlp_mixer) versus established (vitgan)" version: ["v0.1", "v0.2"] -> not sure what the precise differences are here, @mehdidc ? dimension: [128, 256, 512, 1024] -> correlates directly with accuracy of model. bigger is better, but slower. depth: [8, 16, 32] -> number of hidden layers. correlates directly with accuracy of model. bigger is better, but slower.

this info is contained in the filename (albeit cryptically) . The format is: {dataset}_{depth}x{dimension}_{type}_{version} if you remove the curly braces. So cc12m_32x1024_vitgan_v0 gives you: dataset: cc12m depth: 32 dimension 1024 type: vitgan version: v0

From skimming your post @zeke am I correct in assuming you have a somewhat limited API to work with on replicate? There are a few ways this information could be presented. Perhaps easiest would be to summarize this info and make it easy to get to from replicate.