NotImplementedError: Cannot copy out of meta tensor; no data! when using device = "auto" in pipeline()

yongjer commented 1 year ago

System Info

transformers version: 4.34.0
Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.31
Python version: 3.11.6
Huggingface_hub version: 0.17.3
Safetensors version: 0.4.0
Accelerate version: 0.24.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
GPU: RTX2060 6G

Who can help?

@Narsil

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

here is my code below:

def ndarray_to_image(ndarray):
    return Image.fromarray(np.uint8(ndarray))
import cv2
from transformers import pipeline
from PIL import Image
import numpy as np
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()

    cv2.imshow('frame', frame)

    image = ndarray_to_image(frame)

    pipe = pipeline("object-detection", model="facebook/detr-resnet-50", device_map="auto")
    result = pipe(image)
    print(result)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

when set pipeline(device_map="auto") will raise an error:

{
    "name": "NotImplementedError",
    "message": "Cannot copy out of meta tensor; no data!",
    "stack": "---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/home/yongjer/程式/object detection/main.ipynb 儲存格 1 line 1
     <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=11'>12</a> cv2.imshow('frame', frame)
     <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=13'>14</a> image = ndarray_to_image(frame)
---> <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=15'>16</a> pipe = pipeline(\"object-detection\", model=\"facebook/detr-resnet-50\", device_map=\"auto\")
     <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=16'>17</a> result = pipe(image)
     <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=17'>18</a> print(result)

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/pipelines/__init__.py:834, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    832 if isinstance(model, str) or framework is None:
    833     model_classes = {\"tf\": targeted_task[\"tf\"], \"pt\": targeted_task[\"pt\"]}
--> 834     framework, model = infer_framework_load_model(
    835         model,
    836         model_classes=model_classes,
    837         config=config,
    838         framework=framework,
    839         task=task,
    840         **hub_kwargs,
    841         **model_kwargs,
    842     )
    844 model_config = model.config
    845 hub_kwargs[\"_commit_hash\"] = model.config._commit_hash

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/pipelines/base.py:269, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
    263     logger.warning(
    264         \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \"
    265         \"Trying to load the model with Tensorflow.\"
    266     )
    268 try:
--> 269     model = model_class.from_pretrained(model, **kwargs)
    270     if hasattr(model, \"eval\"):
    271         model = model.eval()

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:565, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    563 elif type(config) in cls._model_mapping.keys():
    564     model_class = _get_model_class(config, cls._model_mapping)
--> 565     return model_class.from_pretrained(
    566         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    567     )
    568 raise ValueError(
    569     f\"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\
\"
    570     f\"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}.\"
    571 )

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/modeling_utils.py:3085, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3082     config = cls._check_and_enable_flash_attn_2(config, torch_dtype=torch_dtype, device_map=device_map)
   3084 with ContextManagers(init_contexts):
-> 3085     model = cls(config, *model_args, **model_kwargs)
   3087 # Check first if we are `from_pt`
   3088 if use_keep_in_fp32_modules:

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:1498, in DetrForObjectDetection.__init__(self, config)
   1495 super().__init__(config)
   1497 # DETR encoder-decoder model
-> 1498 self.model = DetrModel(config)
   1500 # Object detection heads
   1501 self.class_labels_classifier = nn.Linear(
   1502     config.d_model, config.num_labels + 1
   1503 )  # We add one for the \"no object\" class

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:1330, in DetrModel.__init__(self, config)
   1327 super().__init__(config)
   1329 # Create backbone + positional encoding
-> 1330 backbone = DetrConvEncoder(config)
   1331 object_queries = build_position_encoding(config)
   1332 self.backbone = DetrConvModel(backbone, object_queries)

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:361, in DetrConvEncoder.__init__(self, config)
    359 # replace batch norm by frozen batch norm
    360 with torch.no_grad():
--> 361     replace_batch_norm(backbone)
    362 self.model = backbone
    363 self.intermediate_channel_sizes = (
    364     self.model.feature_info.channels() if config.use_timm_backbone else self.model.channels
    365 )

File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:319, in replace_batch_norm(model)
    316 if isinstance(module, nn.BatchNorm2d):
    317     new_module = DetrFrozenBatchNorm2d(module.num_features)
--> 319     new_module.weight.data.copy_(module.weight)
    320     new_module.bias.data.copy_(module.bias)
    321     new_module.running_mean.data.copy_(module.running_mean)

NotImplementedError: Cannot copy out of meta tensor; no data!"
}

Expected behavior

when set device=0 rather than device_map = "auto", it works

LysandreJik commented 1 year ago

cc @SunMarc

pedrogengo commented 1 year ago

Hello @yongjer @LysandreJik @SunMarc

This seems a tricky bug. I would like to try to fix it, but maybe I will need some help on how to approach it.

The issue is:

When you use device_map = "auto", internally transformers creates a context manager from accelerate (https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/src/transformers/modeling_utils.py#L3081 and https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/src/transformers/modeling_utils.py#L3086). You can see that this context manager basically set the default device to be "meta" (https://github.com/huggingface/accelerate/blob/dab62832de44c84e80045e4db53e087b71d0fd85/src/accelerate/big_modeling.py#L51-L81).

During the instantiation of the DETR model, there is a step where we want frozen the batch norm (https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/src/transformers/models/detr/modeling_detr.py#L307-L327), but the backbone, which was created with timm, is using meta device, i.e., the weight are not materialized so we can't copy.

As a workaround we can try to guarantee that the backbone model will be created on a physical device, but it breaks a bit the idea of device_map.

Any thoughts on how to solve this issue?

piEsposito commented 1 year ago

If I'm not wrong (I usually am), we could solve it by not trying to load weights on the DetrFrozenBatchNorm2D if the device is meta, something like:

def replace_batch_norm(model):
    r"""
    Recursively replace all `torch.nn.BatchNorm2d` with `DetrFrozenBatchNorm2d`.

    Args:
        model (torch.nn.Module):
            input model
    """
    for name, module in model.named_children():
        if isinstance(module, nn.BatchNorm2d):
            new_module = DetrFrozenBatchNorm2d(module.num_features)

            if not module.weight.device == torch.device("meta"):
                new_module.weight.data.copy_(module.weight)
                new_module.bias.data.copy_(module.bias)
                new_module.running_mean.data.copy_(module.running_mean)
                new_module.running_var.data.copy_(module.running_var)

            model._modules[name] = new_module

        if len(list(module.children())) > 0:
            replace_batch_norm(module)

And then add something like

self._no_split_modules = ["DetrModel", "DetrMLPPredictionHead", "nn.Linear"]

To the DetrForObjectDetection constructor method.

SunMarc commented 1 year ago

This should be solved once the PR is merged !

huggingface / transformers