siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.
https://github.com/siliconflow/onediff/wiki
Apache License 2.0
1.7k stars 104 forks source link

Dev nodes nexfort booster #911

Closed ccssu closed 5 months ago

ccssu commented 5 months ago

Nexfort

cd ComfyUI

# For CUDA Graph
export NEXFORT_FX_CUDAGRAPHS=1

# For best performance
export TORCHINDUCTOR_MAX_AUTOTUNE=1
# Enable CUDNN benchmark
export NEXFORT_FX_CONV_BENCHMARK=1
# Faster float32 matmul
export NEXFORT_FX_MATMUL_ALLOW_TF32=1

# For graph cache to speedup compilation
export TORCHINDUCTOR_FX_GRAPH_CACHE=1

# For persistent cache dir
export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor

# debug
# export  TORCH_LOGS="+dynamo" 
# export  TORCHDYNAMO_VERBOSE=1
# export NEXFORT_DEBUG=1 NEXFORT_FX_DUMP_GRAPH=1 TORCH_COMPILE_DEBUG=1

python main.py --gpu-only --disable-cuda-malloc --port 8188 --cuda-device 6

How to use Nexfort

Case 1

# Compile arbitrary models (torch.nn.Module)
import torch
import onediff.infer_compiler as infer_compiler

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = torch.nn.Linear(100, 10)

    def forward(self, x):
        return torch.nn.functional.relu(self.lin(x))

mod = MyModule().to("cuda").half()
with torch.inference_mode():
    compiled_mod = infer_compiler.compile(mod,
        backend="nexfort",
        options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
    )
    print(compiled_mod(torch.randn(10, 100, device="cuda").half()))

Case 2

import torch
import onediff.infer_compiler as infer_compiler
@infer_compiler.compile(
    backend="nexfort",
    options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
)
def foo(x):
    return torch.sin(x) + torch.cos(x)

print(foo(torch.randn(10, 10, device="cuda").half()))

Vae

ComfyUI Workflow

speedup_vae

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: vae} Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 3.02 s 2.95 s 2.31%

First compilation time: 321.92 seconds

image

Lora

ComfyUI Workflow

speedup_vae_unet

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: vae + unet} Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 3.02 s 1.85 s 38.07 %

First compilation time: 878.19 seconds

image

Controlnet

ComfyUI Workflow

cnet_speedup

Result

{ model: sdxl, batch_size: 1 , image: 1024x1024 , speedup: controlnet} Accelerator Baseline (non-optimized) OneDiff (Nexfort) Percentage improvement
NVIDIA GeForce RTX 4090 4.93 s 4.07 s 17.44 %

First compilation time: 437.84 seconds

image

IPAdapter