MLC Stable Diffusion for RK3588's Mali GPU
Run Stable Diffusion on RK3588's Mali GPU with MLC/TVM.

Currently generate a 512x512 image costs about 500 seconds (including model loading and GPU kernel compilation time. The actual inference time is less). The U-Net runs at 21sec per iteration.


  1. Get a Stable Diffusion 1.5 model from Hugging Face/CivitAI/whatever model site. You can use any fine-tuned model since all of them are based on the same architecture.

  2. The model is likely in .safetensor/.pth format. Convert it to Hugging Face Diffusers format using convert_model_from_pth_safetensors.py script. You can use the following command to convert the model:

python ./convert_model_from_pth_safetensors.py --checkpoint_path ./anythingv5.safetensors --dump_path ./anythingv5/ --from_safetensors --original_config_file ./v1-inference.yaml

(This can be done on any machine, not necessarily on the RK3588)

  1. Install TVM. Follow TVM’s documentation to build from latest master source. Please enable the OpenCL backend

  2. You may need to check OpenCL support on your RK3588. Follow https://llm.mlc.ai/docs/install/gpu.html#orange-pi-5-rk3588-based-sbc .

  3. Edit the build.py script to match the model you want to build.

def trace_models(
    device_str: str,
) -> Tuple[tvm.IRModule, Dict[str, List[tvm.nd.NDArray]]]:
    from diffusers import StableDiffusionPipeline

    #pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
    pipe = StableDiffusionPipeline.from_pretrained("./anythingv5")
  1. Build the model using the following command:
python ./build.py

The warning about missing tuned operators is expected. The build script will generate a dist directory with the built model.

  1. Run the model using the following command:
python ./deploy.py --device-name opencl


  1. Very recommended to tune the model on board (not using RPC) because the measurer will automatically detect a failed trial and run the next one. If you use RPC, the measurer will not detect the failed trial and will wait for timeout, which is set to ~3 minutes, causing the tuning even slower.

  2. Do the workarounds of 3,4 in Pitfalls section.

  3. Uncomment the code in build.py to enable tuning:

    # # tuning part
    # # delete the VAE part of the model when tuning u-net. It will interfere with the tuning. Also it can run on NPU? https://clehaxze.tw/gemlog/2023/07-15-inexhaustive-list-of-models-that-works-on-rk3588.gmi
    # entry_funcs = ['clip', 'unet', 'dpm_solver_multistep_scheduler_convert_model_output', 'dpm_solver_multistep_scheduler_step', 'pndm_scheduler_step_0', 'pndm_scheduler_step_1', 'pndm_scheduler_step_2', 'pndm_scheduler_step_3', 'pndm_scheduler_step_4', 'image_to_rgba', 'concat_embeddings']

    # new_mod = tvm.IRModule()
    # for gv, func in mod.functions.items():
    #     try:
    #         if func.attrs["global_symbol"] == "main" and func.attrs["num_input"] == 1: # vae
    #             continue
    #     except:
    #         pass
    #     new_mod[gv] = func
    # mod = new_mod
    # mod = relax.transform.DeadCodeElimination(entry_funcs)(mod)
    # debug_dump_script(mod, "mod_tune.py", args)

    # # run tuning
        runner=ms.runner.LocalRunner(timeout_sec=180,  # need to be that long!
                                     maximum_process_uses=1, # to avoid buggy behaivour of mali opencl that subsequent runs fail after the first failure # this code change is not committed yet
                                            number=1,    # avoid timeout 2
                                            min_repeat_ms=0,  # https://github.com/apache/tvm/issues/16276
        # runner=runner,
  1. Execute build.py. Wait for a long time(>20hrs, have a good day!). You can interrupt the tuning process at any time if you notice that the latency is not improving anymore. The tuning process will save the best result so far in the log_db_my directory. But the tuning process will not be able to continue from the last checkpoint if you interrupt it. (Please open an feature request in TVM if you want this feature. I want it too. See: https://discuss.tvm.apache.org/t/metaschedule-how-to-resume-tuning/15298). Better to run the tuning process in a screen session.

log_dir = 'log_db_fp16_clip_unet'

log_file_path = os.path.join(log_dir, 'logs/tvm.meta_schedule.logging.task_scheduler.log')


pattern_trials = r'Total trials: (\d+)' pattern_latency = r'Total latency (us): ([\d.e+]+)'


trials_list = [] latency_list = []


with open(log_file_path, 'r') as file: log_data = file.read()

匹配Total trials和Total latency

trials_matches = re.findall(pattern_trials, log_data) latency_matches = re.findall(pattern_latency, log_data)


for trial, latency in zip(trials_matches, latency_matches): trials_list.append(int(trial)) latency_list.append(float(latency))


with open('trials_latency.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Total trials', 'Total latency (us)']) for trial, latency in zip(trials_list, latency_list): writer.writerow([trial, latency])


plt.figure(figsize=(10, 6)) plt.plot(trials_list, latency_list, marker='') plt.xlabel('Total trials') plt.ylabel('Total latency (us)') plt.title('Latency vs Trials') plt.show()

5. Comment the tuning code in `build.py` and add a `MetaScheduleApplyDatabase` line to apply the result. The result applied first will not be replaced by the second one so you should apply the best result first and then the second best result.

6. Test your result and see if the model is faster, or, how many hours you have wasted waiting for tuning?

7. Happy hacking!

8. Remember to star this repo if you find it useful!

## Limitations

- Model is running in FP32. FP16 would be faster and smaller in memory but I don't know how to convert the model to FP16.
- The model is under-tuned because tuning it is so slow (The current result is from a 48-hour tuning). The model can be further optimized by tuning it for a longer time. But try FP16 first!

## Why not NPU? More FLOPS!

- The RKNPU2 SDK is crappy and buggy
- ~~RKNPU2 does not support MatMul >= 256x256 on its model convertion while U-Net has large MatMul operations.~~ Update: Since RKNPU2 SDK 2.0.0b0 the limitation is removed, so you can try to run the model on NPU. (currently I don't have the interest to do this. Better waiting for SD3 since its DiT architecture which is easier to add NPU as well as dynamic shape support?)

## Pitfalls

1. `torch._dynamo.exc.BackendCompilerFailed: backend='_capture' raised: AssertionError: Unsupported function type position_ids`
  downgrade torch to 2.0.0 ~ 2.1.1(2.1.1 tested working)

2. ` expect a Tuple with 1 elements,  but get a Tuple with 196 elements.`                                                                                 
  add a `[]` in utils.py:
    def transform_params(
        #new_params[name] = vm[name + "_transform_params"](params)
        new_params[name] = vm[name + "_transform_params"]([params])

Honestly I don't know why this is happening. Version mismatch?

  1. dmesg shows Iterator PROGRESS_TIMER timeout error: Mali GPU timeout is too short. Increase the timeout in the Mali GPU driver:
    echo 99999999999 > /sys/class/misc/mali0/device/progress_timeout
  2. dmesg shows CS_INHERIT_FAULT error: A GPU fault will sometimes cause subsequent GPU operations in the same process to fail. So when tuning the model, better run separate processes for each try: (in local_runners.py add maximum_process_uses=1 param to PopenPoolExecutor)
  3. CL_OUT_OF_HOST_MEMORY error: See https://github.com/apache/tvm/issues/16276

