In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.
In the model description for the stable diffusion benchmark https://github.com/mlcommons/training/tree/master/stable_diffusion#the-model we are quite clear that the latent output of the autoencoder is 64x64x4, but then don't state the output embedding size of the OpenCLIP-ViT/H text-encoder that is also fed into the UNet backbone.
I am not sure, but the correct reference might be https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json? in which case that embedding size is 1024?