Stable_diffusion: document embedding size from ViT-H into Unet

matthew-frank commented 1 year ago

In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.

I am not sure, but the correct reference might be https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json? in which case that embedding size is 1024?

ahmadki commented 11 months ago

Added with https://github.com/mlcommons/training/pull/677

nv-rborkar commented 11 months ago

Closed with the PR above.

mlcommons / training

Stable_diffusion: document embedding size from ViT-H into Unet #663