mlcommons / mobile_app_open

Mobile App Open
https://mlcommons.org/en/groups/inference-mobile/
Apache License 2.0
41 stars 20 forks source link

Stable Diffusion for 4.1 or later #877

Open freedomtan opened 2 months ago

freedomtan commented 2 months ago
  1. mainly follow what the main Inference group have and replace SD XL with 1.5
  2. we don't have 1.5 keras or tflite, we can start with 1.4 tflite files (1.5 and 1.4 have the same architecture, just different weights)
  3. we can try to convert Inference scripts to C/C++ files (selected some images from the COCO as the benchmark dataset)
  4. there are 2 metrics (FID and ??) @freedomtan to update this.
freedomtan commented 2 months ago

from Inference group task summary slide: https://docs.google.com/presentation/d/1jHuhzyo_4zR1gjIsAxMywpDDN_D7H0mG0CoPqkPi3PU/edit?usp=drive_link

freedomtan commented 2 months ago

@mohitmundhragithub please check if you need more information

For mobile device, we should start from

  1. Stable Diffusion 1.5 (or 1.4 when 1.5 tflite files not ready)
  2. output image size 512x512
freedomtan commented 2 months ago

more on stable diffusion

freedomtan commented 1 month ago

@Mostelk With the AI Edge Torch, a tool that can convert PyTorch models directly to TFLite, I managed to convert HuggingFace SD 1.5 Unet to saved_model and tflite. The tool has some rough edges, but it works to mostly.

freedomtan commented 1 month ago

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

mohitmundhragithub commented 1 month ago

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

Freedom, unable to access the link. Possible to share in some google drive location?

freedomtan commented 1 month ago

Converting Stable Diffusion 1.5 Text Encoder and Unet to TFLite, https://colab.research.google.com/drive/11S1LM9S0eHO22VF3-AgOJJXJ9jNXr-ZN?usp=sharing.

Freedom, unable to access the link. Possible to share in some google drive location?

access permission updated. Please try again. It's in my personal Google Drive.

aswib commented 1 month ago

Hi @freedomtan Thanks. We were able to convert the encoder and Unet to tflite versions. What about the VAE decoder from the SDpipeline. Any limitations in converting them or we have not tried it yet?

freedomtan commented 1 month ago

Hi @freedomtan Thanks. We were able to convert the encoder and Unet to tflite versions. What about the VAE decoder from the SDpipeline. Any limitations in converting them or we have not tried it yet?

I had troubles to convert the VAE decoder. It seems the VAE decoder model doesn't follow some kind of rules so that the torch.export.export() doesn't work. Need more works. That is, we need to modify the VAE decoder, I guess.

freedomtan commented 2 weeks ago

let's try to implement FID and CLIP (with images / distribution of output tensor pre-generated on a host machine, I mean not trying to do Inception V3 and CLIP image part on Android devices).

@RSMNYS to check feasibility.

freedomtan commented 2 weeks ago

For CLIP score: it's to calculate the cosine similarity between text features and image features. Where text features are features from sending captions to the CLIP text encoder. That is, yes, surely it's possible to pre-compute text features. But we need to generate it too (when sending prompts to the text encoder). For the image features, we need to send an image to the CLIP image encoder and get the output. Using COCO images is not enough. We have to send generated images to the CLIP image encoder to get image features.

So: I guess we need CLIP image encoder on Android.

@mohitmundhragithub and @AhmedTElthakeb

freedomtan commented 1 week ago

For FID score, we need to compare two sets of distributions by sending groundtruth images and generated images into Inception V3. The former one could be generated offline; the second one is supposed to be computed on devices. So we need to have Inception V3 related stuff, tool.

freedomtan commented 1 week ago

Let's try discuss in the mobile working group meeting.

@freedomtan to do CLIP score in C/C++ @aswib to do FID score

freedomtan commented 1 week ago

For CLIP score: it turns out to be quite straightfward. Convert an OpenAPI CLIP model to tflite and run it with TFLite interpreter, then we can get CLIP scores.

See my reference code at https://github.com/freedomtan/clip_score_on_android/.

For our accuracy use, we need to

  1. use tokenizer output and pad the attention mask
  2. resize image (512x512) to 224x224 and convert NHWC to NCHW. (we need this anyway, InceptionV3 1x299x229x3)
freedomtan commented 4 days ago

For the output to LoadGen: the ::mlperf::QuerySamplesComplete() is called to return processed outputs

https://github.com/mlcommons/mobile_app_open/blob/09e4b41f72714c84fb0cf844433da47ff726f62f/flutter/cpp/mlperf_driver.cc#L83

for non-offline case: https://github.com/mlcommons/mobile_app_open/blob/09e4b41f72714c84fb0cf844433da47ff726f62f/flutter/cpp/mlperf_driver.cc#L66-L80

What returned in the QuerySampleResponse, which uses uintptr data

https://github.com/mlcommons/inference/blob/9e2c9f642e6e12b74e7c08d2e099c8af0e542873/loadgen/query_sample.h#L49-L76

My understand is LoadGen actually treats output data as opaque blobs and it's not necessary to return accuracy metrics.

freedomtan commented 4 days ago

we may need to or extend the ComputeAccuracy() if we use 2 (FID and CLIP) scores. However, this has nothing to do with LoadGen interface.

https://github.com/mlcommons/mobile_app_open/blob/09e4b41f72714c84fb0cf844433da47ff726f62f/flutter/cpp/dataset.h#L68