huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.91k stars 26.27k forks source link

Detr Models cannot be loaded with `device_map="auto"` #23145

Open chiragjn opened 1 year ago

chiragjn commented 1 year ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

from transformers import pipeline

p = pipeline(
    "object-detection", 
    model="facebook/detr-resnet-50", 
    image_processor="facebook/detr-resnet-50", 
    device_map="auto"
)

Expected behavior

This does not work because the transformers.models.detr.modeling_detr.DetrConvEncoder model init involves copy weights from nn.BatchNorm2d to DetrFrozenBatchNorm2d which is not allowed when on a meta device.

 File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 779, in pipeline
    framework, model = infer_framework_load_model(
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/pipelines/base.py", line 262, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2629, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py", line 1373, in __init__
    self.model = DetrModel(config)
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py", line 1205, in __init__
    backbone = DetrConvEncoder(config)
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py", line 354, in __init__
    replace_batch_norm(backbone)
  File "/Users/chiragjn/venv39/lib/python3.9/site-packages/transformers/models/detr/modeling_detr.py", line 314, in replace_batch_norm
    frozen.weight.data.copy_(bn.weight)
NotImplementedError: Cannot copy out of meta tensor; no data!

The model loads fine with a specific device with device argument.

sgugger commented 1 year ago

cc @alaradirik and @amyeroberts

alaradirik commented 1 year ago

Hi @chiragjn, I was able to replicate the error on my local (also macOS-13.1-x86_64-i386-64bit) and I'm looking into the issue.

alaradirik commented 1 year ago

A quick update - I tracked down the issue to the accelerate library, setting device_map=True sets low_cpu_mem_usage to True. This causes the model parameters to be initialized as meta tensors, which can not be copied to CPU or GPU without tensor conversion.

This issue also affects DETA, Conditional DETR, Deformable DETR and Table Transformers as they have identical frozen modules that are initialized by copying the parameters of their respective backbone models. We will be opening a fix PR shortly!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AlonZolfi commented 1 year ago

Hey, is there any progress with this issue?

amyeroberts commented 1 year ago

Hi @AlonZolfi, @alaradirik has now left Hugging Face, so I'm picking this up.

As @alaradirik mentions, this arises as a consequence the replacement of the batch norm in the backbone of these models. I'll be digging into it properly next week when I have a bit more time.

Re-opening the issue as it's not yet solved and will keep you posted!

amitbaras commented 1 year ago

It closed again, there was some progress with the issue?

ranchlai commented 1 year ago

@amyeroberts the problem is indeed annoying, I have similar problem fine-tuning some models like llama. anyone working to solve it?

AlonZolfi commented 1 year ago

Hey @amyeroberts, was this issue solved already?

amyeroberts commented 12 months ago

@AlonZolfi @ranchlai No, I unfortunately haven't had bandwidth to address this yet. I'm marking it as a difficult issue that anyone in the community can try and tackle if they wish.

ArthurZucker commented 12 months ago

I got the same problem when using accelerate, doing model.cuda() worked as expected. The related PR is #26150 where:

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto")

so pinging @muellerzr as this is probably related to our hf hooks. Now I might be creating the buffer and tensors in a wrong way but can´t get it to load so help is appreciated! (See the UMTRelativePositionalBias class)

(using accelerate 0.22.0)

muellerzr commented 12 months ago

cc @SunMarc

SunMarc commented 12 months ago

Hi @ArthurZucker , I left a few comments on the PR to explain the issue. Hope that you have enough context to fix the problem ;)