Problem on custom device_map

wonkyoc commented 5 days ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/user/miniconda3/bin/accelerate
- Python version: 3.12.3
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.65 GB
- GPU type: NVIDIA Quadro T400
- `Accelerate` default config:
        Not found

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

from diffusers import StableDiffusionPipeline, UNet2DConditionModel
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id)
pipe = pipe.to("cuda")

device_map = {'conv_in': 'cpu', 'time_embedding': 'cpu', 'down_blocks.0': 'cuda', 'down_blocks.1': 'cpu', 'down_blocks.2': 'cpu', 'down_blocks.3': 'cpu', 'up_blocks.0': 'cpu', 'up_blocks.1': 'cpu', 'up_blocks.2': 'cpu', 'up_blocks.3': 'cuda', 'mid_block': 'cpu', 'conv_norm_out': 'cpu', 'conv_out': 'cpu'}

pipe.unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet", use_safetensors=True, device_map=device_map)

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

image.save("astronaut_rides_horse.png")

Expected behavior

What I want to see is that my device map works correctly. I put down_blocks.0 on cuda and down_block.1 on CPU but it seems not working how I want.

If you look into a screenshot, down_blocks.1 (CrossAttnDownBlock 1) still uses cudaMalloc on Attention, which shouldn't use this operator. I do see a long copying time, which I believe copying is from cuda -> CPU based on the device map so I thought a hook seems trying to use CPU but I don't understand why a new hook still uses cuda. This happens across other layers, which I set as CPU execution. In other layers, due to this issue, every pre/post forward copies data.

Is this bug?

Another side question is how much data is moved. Although I figured out that set_module_tensor_to_device() & send_to_device() are responsible for data copying, it is not clear for me that these functions copy only the output from the previous child layer or entire layers within a block (e.g., CrossAttnDownBlock).

wonkyoc commented 5 days ago

Okay. I found that device_map actually only offloads the model weight, not the execution as well. If there is a GPU then the GPU is the main priority in executing the model.

muellerzr commented 1 day ago

Correct, that's how our big model inference works.

cc @SunMarc

huggingface / accelerate