Open larekrow opened 1 year ago
I also encounter this issue. I run inference_test.py to load OPT-IML-30B downloaded from Huggingface.
Hi, any updates?
I could successfully run the script. I first saved the sharded checkpoints to a custom directory and then load the sharded ones for inference (This 2379 helps me a lot! ).
Maybe you could try to set arg replace_method
as auto
.
Hi @qtli, appreciate the suggestion, but I did not use 'replace_method': 'auto'
following PR #2831. I did try to run it again upon your suggestion for good measure though -- same error. I also did not use the method to obtain tensor parallels from HuggingFace weights as avoiding HF is the goal (since 175B is not on HF). I want to use metaseq OPT-IML TPs directly.
This error is encountered when 'checkpoint': checkpoint_json
is used, 'replace_with_kernel_inject': True
and isinstance(self.module, torch.nn.Module) == True
. Not sure if tp_size
> 1 contributes to the condition that encounters this error.
This could be related to #2616 but I am not sure. I circumvented the state_dict
issues by adding custom code in _load_checkpoint()
in engine.py
:
def _metaseq_opt_to_pt(sd):
keys_to_delete = [
"decoder.version",
]
for key in keys_to_delete:
if key in sd:
sd.pop(key)
keys_to_rename = {
"decoder.layer_norm.weight": "decoder.final_layer_norm.weight",
"decoder.layer_norm.bias": "decoder.final_layer_norm.bias",
}
for old_key, new_key in keys_to_rename.items():
if old_key in sd:
sd[new_key] = sd.pop(old_key)
for key in list(sd.keys()):
if ".qkv_proj." in key:
q_name = key.replace(".qkv_proj.", ".q_proj.")
k_name = key.replace(".qkv_proj.", ".k_proj.")
v_name = key.replace(".qkv_proj.", ".v_proj.")
value = sd[key]
depth = value.shape[0]
assert depth % 3 == 0
# `SequeuceParallelTransformerBlock` has QKV weight is separated in K,V,Q despite the naming:
# https://cs.github.com/facebookresearch/metaseq/blob/51871bd73cd04c038f239ea2a26db1d7f6b37927/metaseq/modules/sequence_parallel_transformer_layer.py#L97
k, v, q = torch.split(value, depth // 3, dim=0)
sd[q_name] = q
sd[k_name] = k
sd[v_name] = v
del sd[key]
return sd
...
checkpoint[self._choose_module_key(checkpoint)] = _metaseq_opt_to_pt(checkpoint[self._choose_module_key(checkpoint)])
self.module.load_state_dict(
state_dict=checkpoint[self._choose_module_key(checkpoint)],
strict=load_module_strict)
Would appreciate any help or suggestion.
I am seeing this error too. DeepSpeed version 0.9.2
config = AutoConfig.from_pretrained(model_name)
with deepspeed.OnDevice(dtype=dtype, device="meta"):
model = AutoModelForCausalLM.from_config(config)
model = deepspeed.init_inference(
model,
tensor_parallel = tp_config,
base_dir=repo_root,
replace_with_kernel_inject=args.kernel_injection,
**kwargs
)
With replace_with_kernel_inject = False, I get this error:
model = deepspeed.init_inference(
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 333, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 204, in __init__
self._apply_injection_policy(config, client_module)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 396, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 494, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 727, in replace_module
replaced_module, _ = _replace_module(model, policy)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 752, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 752, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 752, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 744, in _replace_module
replaced_module = policies[child.__class__][0](child, policies[child.__class__][-1], layer_id)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 490, in replace_fn
new_module = replace_wo_policy(child, _policy)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 473, in replace_wo_policy
return _replace_module(module)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 470, in _replace_module
_replace_module(child, name)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 466, in _replace_module
setattr(r_module, name, linear_policies[child.__class__](child, prev_name + '.' + name,
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 392, in _replace
data = mp_replace.copy(new_weight, child.weight.data)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 89, in copy
assert not dst.data.is_meta # the torch.Tensor.copy_ method used below will silently fail on meta tensors
With replace_with_kernel_inject = True:
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 333, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 207, in __init__
self.module.to(device)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1896, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Does this depend on what weights are being loaded? I am running OPT from hugging-face.
@molohov Hi bro, have u solved this problem
the same issue here
The same issue
I had some success loading the model this way:
with deepspeed.OnDevice(dtype=dtype, device="meta"):
model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
model = deepspeed.init_inference(
model,
tensor_parallel = tp_config,
base_dir=repo_root,
replace_with_kernel_inject=args.kernel_injection,
**kwargs
)
I think this is because low_cpu_mem_usage=True
initializes the HF model with meta tensors for you, allowing DS to copy it correctly.
I had some success loading the model this way:
with deepspeed.OnDevice(dtype=dtype, device="meta"): model = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True) model = deepspeed.init_inference( model, tensor_parallel = tp_config, base_dir=repo_root, replace_with_kernel_inject=args.kernel_injection, **kwargs )
I think this is because
low_cpu_mem_usage=True
initializes the HF model with meta tensors for you, allowing DS to copy it correctly.
Does this really work for anyone ? With OPT this fails for me
The bug In
deepspeed.module_inject.replace_module.py
,replace_module()
is being called on meta tensors before the actual weights are loaded just a few lines below, resulting inNotImplementedError: Cannot copy out of meta tensor; no data!
error.The code excerpt
checkpoint_json
The error
Some line numbers in the trackback may be inaccurate due to incorporating changes from GH-2940 and my own code.
ds_report
System info
Additional context I am trying to load OPT-IML-30B downloaded as 2 TPs from Metaseq before I move on to OPT-IML-175B which has 16 TPs.
Please advise on how to proceed, thank you!