huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
151 stars 195 forks source link

Not able to get good performance for diffusion models when doing single image inference with batch size 1 #1195

Open basantaxpatra opened 2 months ago

basantaxpatra commented 2 months ago

System Info

System Configuration: Single node Habana Gaudi setup
Firmware Version: hl-1.15.0-fw-48.2.1.1
Software Stack: Synapse AI 1.15

Information

Tasks

Reproduction

$ docker pull vault.habana.ai/gaudi-docker/1.15.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest $ docker run --rm -it vault.habana.ai/gaudi-docker/1.15.0/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest bash $ git clone git@github.com:huggingface/optimum-habana.git $ optimum-habana $ pip install . $ cd examples/stable-diffusion $ pip install -r requirements.txt $ python text_to_image_generation.py \ --model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 \ --prompts "Sailing ship painting by Van Gogh" "A shiny flying horse taking off" \ --num_images_per_prompt 20 \ --batch_size 8 \ --image_save_dir /tmp/stable_diffusion_xl_images \ --scheduler euler_discrete \ --use_habana \ --use_hpu_graphs \ --gaudi_config Habana/stable-diffusion \ --bf16

Logs for reference: 2 prompt(s) received, 20 generation(s) per prompt, 8 sample(s) per batch, 5 total batch(es). {'generation_runtime': 470.2324, 'generation_samples_per_second': 0.219, 'generation_steps_per_second': 0.068}

initial compilation took 170 seconds, so if we disregard that, it'd be like 300 second for 32 images which is ~9.2 seconds per image on SDXL (H100s are around 2-3seconds depending on sampling params) [{"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:17:58.694850", "statistics": {"TotalNumber": 1, "TotalTime": 2406683, "AvgTime": 2406683.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:19:04.525621", "statistics": {"TotalNumber": 2, "TotalTime": 66733949, "AvgTime": 33366974.5}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:19:05.394485", "statistics": {"TotalNumber": 3, "TotalTime": 66871477, "AvgTime": 22290492.333333332}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:20:08.701577", "statistics": {"TotalNumber": 4, "TotalTime": 130001484, "AvgTime": 32500371.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:20:09.602500", "statistics": {"TotalNumber": 5, "TotalTime": 130138275, "AvgTime": 26027655.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:20:58.849669", "statistics": {"TotalNumber": 6, "TotalTime": 144735532, "AvgTime": 24122588.666666668}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:22:02.477322", "statistics": {"TotalNumber": 7, "TotalTime": 207751639, "AvgTime": 29678805.57142857}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:22:03.371944", "statistics": {"TotalNumber": 8, "TotalTime": 207892568, "AvgTime": 25986571.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:23:06.978577", "statistics": {"TotalNumber": 9, "TotalTime": 271316124, "AvgTime": 30146236.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:23:56.499370", "statistics": {"TotalNumber": 10, "TotalTime": 285510855, "AvgTime": 28551085.5}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:23:57.930979", "statistics": {"TotalNumber": 11, "TotalTime": 285652606, "AvgTime": 25968418.727272727}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:23:58.791526", "statistics": {"TotalNumber": 12, "TotalTime": 285788064, "AvgTime": 23815672.0}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:26:00.652013", "statistics": {"TotalNumber": 13, "TotalTime": 299983406, "AvgTime": 23075646.615384616}}, {"metric_name": "graph_compilation", "triggered_by": "metric_change", "generated_on": "2024-06-10T19:26:01.511422", "statistics": {"TotalNumber": 14, "TotalTime": 300058888, "AvgTime": 21432777.714285713}}, {"metric_name": "graph_compilation", "triggered_by": "process_exit", "generated_on": "2024-06-10T19:26:15.341419", "statistics": {"TotalNumber": 14, "TotalTime": 300058888, "AvgTime": 21432777.714285713}}, {"metric_name": "cpu_fallback", "triggered_by": "process_exit", "generated_on": "2024-06-10T19:26:15.341498", "statistics": {"TotalNumber": 0, "FallbackOps": {}}}, {"metric_name": "memory_defragmentation", "triggered_by": "process_exit", "generated_on": "2024-06-10T19:26:15.341520", "statistics": {"TotalNumber": 0, "TotalSuccessful": 0, "AvgTime": 0, "MaxTime": 0}}]

Expected behavior

initial compilation took 170 seconds, so if we disregard that, it'd be like 300 second for 32 images which is ~9.2 seconds per image on SDXL. Expecting performance ~ 2-3seconds

regisss commented 6 days ago

@basantaxpatra Are you still seeing this issue on newer versions of the lib?