huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.59k stars 25.5k forks source link

NotImplementedError: Cannot copy out of meta tensor; no data when embedding to meta #31560

Open DonggeunYu opened 1 week ago

DonggeunYu commented 1 week ago

System Info

Who can help?

@amyeroberts

Information

Tasks

Reproduction

  1. To send query_position_embeddings to meta, modify it as follows: max_memory = {0: max_size // 2, "cpu": max_size * 2} https://github.com/huggingface/transformers/blob/main/tests/test_modeling_common.py#L30852
  2. Run python3 -m pytest tests/models/deformable_detr/test_modeling_deformable_detr.py::DeformableDetrModelTest::test_disk_offload_safetensors

Expected behavior

FAILED tests/models/deformable_detr/test_modeling_deformable_detr.py::DeformableDetrModelTest::test_disk_offload_safetensors - NotImplementedError: Cannot copy out of meta tensor; no data!
amyeroberts commented 1 week ago

Hi @DonggeunYu, thanks for reporting!

We'll look into it. Out of interest, how did you discover this? Was is modifying the tests, or are the tests just an easy way to demonstrate this behaviour?

DonggeunYu commented 1 week ago

Easy way to demonstrate this behavior. While using a private model, I discovered that there was a problem with nn.Embedding.

DonggeunYu commented 1 week ago

I may be wrong, as I still need to understand the transformers and accelerate code fully. When offload is used, it becomes a meta device during the init process. The weight of nn.Embedding created in __init__ becomes the meta device. If i use nn.Embedding callable, the pre_forward hook of accelerate will match the device of args, kwrags, and embedding.

However, because the embedding weight is inserted into the forward of another module, it enters the pre_forward hook as a meta device. To prove this, the log that the pre_forward of accelerate hook.py outputs the module device and the device of the args. Until the nn.embedding weight in the problem, the module device is meta device, and the args device is cuda. If the nn.embedding weight in the problem enters another module's args, the module device is meta and the args device is meta (embedding weight). An error occurs when performing meta to cuda using send_to_device (args, self.execution_device).

module.__class__.__name__, device of module, device of args
Linear [device(type='meta')] [device(type='cuda', index=0)]
LayerNorm [device(type='meta')] [device(type='cuda', index=0)]
Linear [device(type='meta')] [device(type='cuda', index=0)]
Linear [device(type='meta')] [device(type='cuda', index=0)]
LayerNorm [device(type='meta')] [device(type='cuda', index=0)]
Linear [device(type='meta')] [device(type='meta')]

def pre_forward of accelerate nn.Embedding of transformers

amyeroberts commented 1 week ago

@DonggeunYu Thanks for the update. Indeed, the structure of using the embedding weights rather than the layer in the forward pass is quite odd. cc @muellerzr who knows more about the pre_forward hook of accelerate

DonggeunYu commented 1 day ago

@amyeroberts @muellerz How is the progress?

muellerzr commented 1 day ago

cc @SunMarc