huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.69k stars 932 forks source link

hasattr(tensor, "to") creates an error when using torch.compile #3000

Open SunMarc opened 1 month ago

SunMarc commented 1 month ago

System Info

None

Information

Tasks

Reproduction

Issue when compiling a model that has been dispatch to multiple gpus.

Error encountered:

  File "/home/marc/.venv/lib/python3.8/site-packages/torch/_dynamo/variables/builtin.py", line 1460, in call_hasattr
    return obj.call_hasattr(tx, name)
  File "/home/marc/.venv/lib/python3.8/site-packages/torch/_dynamo/variables/base.py", line 296, in call_hasattr
    unimplemented(f"hasattr {self.__class__.__name__} {name}")
  File "/home/marc/.venv/lib/python3.8/site-packages/torch/_dynamo/exc.py", line 221, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: hasattr ConstDictVariable to

from user code:
   File "/home/marc/.venv/lib/python3.8/site-packages/torch/_dynamo/external_utils.py", line 38, in inner
    return fn(*args, **kwargs)
  File "/home/marc/accelerate/src/accelerate/hooks.py", line 165, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/home/marc/accelerate/src/accelerate/hooks.py", line 364, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/home/marc/accelerate/src/accelerate/utils/operations.py", line 148, in send_to_device
    if is_torch_tensor(tensor) or hasattr(tensor, "to"):

This was something we added to enable accelerate to work with tensordict. cc @vmoens Is there a way to make it work without having to use hasattr. One solution could be to make tensordict library an optional dependency and check if we indeed have a tensordict ? related issue : https://github.com/huggingface/accelerate/issues/2405

cc @muellerzr

To reproduce:

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ["TOKENIZERS_PARALLELISM"] = "false" # silence warnings when compiling

device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16, device_map="auto")
# model.to(device)

tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt = "Why dogs are so cute?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Specify the max length (including both the prompt and the response)
# When calling `generate` with `cache_implementation="static" later, this is also used to create a `StaticCache` object
# with sequence length = `max_length`. The longer the more you will re-use it
model.generation_config.max_length = 128

# without `torch.compile`: each call takes ~ 5.0 seconds (on A100 80G + torch 2.3)
outputs = model.generate(**inputs, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
print(response)

# `torch.compile(model, ...)` is not recommended as you compile callbacks
# and full generate. We recommend compiling only the forward for now. 
# "reduce-overhead" will use cudagraphs. 
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.cache_implementation = "static"

# with `torch.compile` (on A100 80G + torch 2.3)
# 1st call: ~ 90 seconds
outputs = model.generate(**inputs, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
# 2nd call: ~ 60 seconds
outputs = model.generate(**inputs, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
# 3nd call: ~ 1.5 seconds
outputs = model.generate(**inputs, do_sample=False)
response = tokenizer.batch_decode(outputs)[0]
print(response)

Expected behavior

compile as expected

vmoens commented 1 month ago

I think the purpose here was to allow any class that has a to method, tensordict being just one example of that.

For some reason, this works on pytorch nightlies on my machine:

@torch.compile(fullgraph=True)
def func(x):
    if hasattr(x, "to"):
        return x.to("cpu")
    return x
func(torch.randn(3))

EDIT: I can reprod with 2.3 so I think this will be solved in the next major of PyTorch, would that work for you @SunMarc ?

    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: hasattr TensorVariable to

cc @anijain2305: dynamo doesn't support hasattr

SunMarc commented 1 month ago

In your case, you are testing with a tensor, not sure if this will work with another data type (On my example, it was failing with ConstDictVariable). Thanks for the ping !

I think the purpose here was to allow any class that has a to method, tensordict being just one example of that.

Yes, i understand. If there is no good solution, I was just thinking on making sure that at least it works with tensordict !

vmoens commented 1 month ago

Oh yeah insteresting, this fails:

import torch
from tensordict import TensorDict

@torch.compile(fullgraph=True)
def func(x):
    if hasattr(x, "to"):
        return x.to("cpu")
    return x
func(dict(a=torch.randn(3)))

but this runs

func(TensorDict(a=torch.randn(3)))

so it seems that dynamo only likes hasarttr if it's True lol. Say this is patched in 2.5 for instance, would that solve the issue?

SunMarc commented 3 weeks ago

Say this is patched in 2.5 for instance, would that solve the issue?

Yeah ! This is the only issue I have right now in order to make big model inference (multi-gpu) + torch.compile works together. I tried to remove that and it was working fine on torch nightly. The goal for us would be to enable this starting from torch 2.5 since with torch 2.4, there were other errors that I didn't manage to debug. LMK if this is something that will be fixed on torch 2.5 !