huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.13k stars 26.58k forks source link

llama-2 device_map (2,3) & `model.generate` #30115

Closed YooSungHyun closed 6 months ago

YooSungHyun commented 6 months ago

System Info

transformers==4.39.3 torch==2.2.2 CUDA: 12.1 (RTX 3090 * 4) python3.10

Who can help?

@ArthurZucker @younesbelkada @gante

Information

Tasks

Reproduction

import os
import torch
from setproctitle import setproctitle
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from transformers.utils.logging import set_verbosity_error
import time
import json

set_verbosity_error()

os.environ["TORCHDYNAMO_DISABLE"] = "1"

PROMPT = """### User:
hello?

### Assistant:
"""

if __name__ == "__main__":
    torch.multiprocessing.set_start_method("spawn")
    setproctitle("infer")
    model_path = "{model_path}"

    with open(
        "{pytorch_model.bin.index.json's path}",
        "r",
    ) as file:
        data = json.load(file)
    data = data["weight_map"]
    keys = list(data.keys())
    half_index = len(keys) // 2

    for key in keys[:half_index]:
        data[key] = 2

    for key in keys[half_index:]:
        data[key] = 3

    print(data)

    model = AutoModelForCausalLM.from_pretrained(
        model_path, low_cpu_mem_usage=True, device_map=data, torch_dtype=torch.bfloat16, use_cache=True
    )
    print("in")
    model = torch.compile(model)
    model.eval()
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    start = time.time()
    generation_config = GenerationConfig.from_pretrained(model_path)
    input_ids = tokenizer(PROMPT, return_tensors="pt", truncation=True).input_ids
    model_output = model.generate(input_ids=input_ids.to("cuda"), generation_config=generation_config)[0]
    output = tokenizer.decode(model_output, skip_special_tokens=False)
    print(output)
    end = time.time()
    print(f"{end - start:.5f} sec")

use this code any llama-2 model

i want to use model that some parameters are loading gpu 02&03. so, i load parameter index and half assign to gpu02, another assign to gpu03 if i model.generate, raised error

Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:2! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 141, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 644, in forward
    cos, sin = self.rotary_emb(value_states, position_ids)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 739, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1016, in forward
    layer_outputs = decoder_layer(
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1196, in forward
    outputs = self.model(
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2411, in _greedy_search
    outputs = self(
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
    result = self._greedy_search(
  File "/home/bart/LLM42/train/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bart/LLM42/train/infer.py", line 60, in <module>
    model_output = model.generate(input_ids=input_ids.to("cuda"), generation_config=generation_config)[0]
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:2! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

so, i find why that error is raised, i can find here 1

LlamaRotaryEmbedding device's input is None, so inv_freq is assign to cpu 2

Am I using device_map incorrectly?

Expected behavior

generate is well

younesbelkada commented 6 months ago

hi @YooSungHyun Thanks for the issue, yes indeed, when computing the device map, make sure to include inv_freq as well since it is a non persistent buffer you're not assigning it in your device_map attribute

YooSungHyun commented 6 months ago

@younesbelkada yes i understand, but very difficult to me. i found this guide https://huggingface.co/docs/accelerate/concept_guides/big_model_inference#the-devicemap

When I assign the encoder layer to gpu02 and gpu03 respectively, I get the not same device error. I think it's because the encoder layer 20 is operated on gpu02, and the input of layer 21 is operated on gpu03, so the output of gpu02 and the weight of gpu03 don't match each other. So I ended up using max_memory and "auto" to let my problem resolve itself. Is there any way or trick to divide the device_map smartly for each layer?

younesbelkada commented 6 months ago

Thanks @YooSungHyun for getting back ! Yes I think the proposed solution sounds good, depending on your usecase, you might want to use for instance balanced_low_0 to make sure the first GPU is freed

YooSungHyun commented 6 months ago

I don't think there's a better way to do it for now, okay, thanks.