huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.41k stars 5.26k forks source link

global, eager model weight GPU unloading #8605

Open doctorpangloss opened 3 months ago

doctorpangloss commented 3 months ago

What API design would you like to have changed or added to the library? Why?

Most people expect diffusers and transformers "models" to be "unloaded" so that they can "just" "run" a "big" "pipeline" using their "VRAM" so that it "fits."

In other words, author a mixin that keeps track of all weights in Hugging Face hierarchy objects loaded onto the GPU; and when forward is called on any Hugging Face hierarchy object, moves weights in other objects being tracked to ordinary RAM. Essentially, this is sequential CPU offload for scopes larger than a Hugging Face hierarchy object.

What use case would this enable or better enable? Can you give us a code example?

The number of issues about GPU RAM usage scales linearly with adoption. You guys can't deal with the brain damage of having your Issues polluted by this.

Separately, it would eliminate the main source of toil for people who integrate diffusers into other products like ComfyUI.

sayakpaul commented 3 months ago

We cannot control the level of hierarchy project maintainers want to have in their projects.

For the things that are within our scope of control, we try to explicitly document things (such as enable_model_cpu_offload or enable_sequential_cpu_offload. Together when they are combined with other things such as prompt pre-computing, 8bit inference of the text encoders, etc. they do reduce quite a lot of VRAM consumption.

We cannot do everything on behalf of the users as requirements vary from project to project, but we can provide simple and easy-to-use APIs for the users so that they cover the most common use cases.

On a related note, @Wauplin has started to include a couple of modeling utilities within huggingface_hub (such as the inclusion of a model sharding utility). So, I wonder if this is something to consider for him. Or maybe it lies more within the scope of accelerate (maybe something already exists that I am unable to recollect). So, ccing @muellerzr @SunMarc.

doctorpangloss commented 3 months ago

I observe:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

assert memory_usage(x) + memory_usage(y) > gpu_memory_available()
load(x)
x()
unload(x)
load(y)
y()
unload(y)

reinvented over and over again by downstream users of diffusers and PyTorch.

load and unload are kind of hacky ideas. They only make sense in the kind of 1 GPU, personal computer sort of context, that runs 1 task.

doctorpangloss commented 3 months ago

Example implementation:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0')):
  x(...)
  y(...)

and forward in the mixin would check a contextvar for loaded and unloaded models. alternatively:

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0'), models=[x, y]):
  x(...)
  y(...)

could be implemented now with little issue.

More broadly:

# for example, prefer weights and inferencing in bfloat16, then gptq 8 bit if supported, then bitsandbytes 8 bit, etc.
llm_strategy = BinPacking(
  devices=["cuda:0", "cuda:1", ...],
  levels=[
    BitsAndBytesConfig(...),
    GPTQConfig(...),
    torch.bfloat16
  ]
)

# some models do not perform well at 8 bit for example so shouldn't be used there at all
unet_strategy = BinPacking(
  levels=[torch.float16, torch.bfloat16]
)

# maybe there is a separate inference and weights strategy
t5_strategy = BinPacking(load_in=[torch.float8, torch.float16], compute_in=[torch.float16])

with model_management(strategy=llm_strategy):
  x(...)
  with model_management(strategy=unet_strategy):
    y(...)

Ultimately downstream projects keep writing this, and it fuels a misconception that "Hugging Face" doesn't "run on my machine."

For example: "Automatically loading/unloading models from memory [is something that Ollama supports that Hugging Face does not]"

You guys are fighting pretty pervasive misconceptions too: "Huggingface isn't local"

We cannot control the level of hierarchy project maintainers want to have in their projects.

Perhaps the contextlib approach is best.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.