[BUG]Zero++ quantizer unsupport BFloat16

vv12kant commented 1 year ago

Describe the bug When using both the Zero++ and BFloat16 features simultaneously. Sometimes the gathered param is Float15 dtype，but the intermediate result are still BFloat16 dtype.

To Reproduce Steps to reproduce the behavior:

set deepspeed configuration，enable zero++ and bf16.

ds_config = {
"train_batch_size": GLOBAL_BATCH_SIZE,
"train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
"steps_per_print": 10,
"zero_optimization": {
    "stage": 3,
    "offload_param": {
        "device": 'cpu'
    },
    "offload_optimizer": {
        "device": 'cpu'
    },
    "stage3_param_persistence_threshold": 1e4,
    "stage3_max_live_parameters": 3e7,
    "stage3_prefetch_bucket_size": 3e7,
    "memory_efficient_linear": False,

    "zero_quantized_weights": True,
    "zero_hpz_partition_size": 8,
    "zero_quantized_gradients": True,
},
"fp16": {
    "enabled": False
},
"bf16": {
    "enabled": True
},
"gradient_clipping": 1.0,
"prescale_gradients": False,
"wall_clock_breakdown": False,
"hybrid_engine": {
    "enabled": True,
    "inference_tp_size": 8,
    "release_inference_cache": release_inference_cache,
    "pin_parameters": pin_parameters,
    "tp_gather_partition_size": 8,
    "max_out_tokens": 512,
}
}

Create a Bloom HF model，and initialize model engine by ds_config.

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
actor_engine, *_ = deepspeed.initialize(model=model, optimizer=optim, config=ds_config)

run inference.

See error.

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 245, in generate
generate_ret_vals = self._generate(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1568, in generate
return self.sample(
File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2615, in sample
outputs = self(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 927, in forward
lm_logits = self.lm_head(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
result = forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/layers.py", line 52, in forward
output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: expected scalar type BFloat16 but found Half

Desein-Yang commented 1 year ago

I guess it results from transformers/modeling_bloom and it misses a type adaption.

azahed98 commented 1 year ago

I am seeing the same issue with llama 2. Were you able to get this working?

SeunghyunSEO commented 8 months ago

I guess it results from transformers/modeling_bloom and it misses a type adaption.

hi guys. i encounter same issue and i guess this is not because of specific modeling_xxx.py. in my case, when i casting model with bfloa16 and use single GPU, activations of all layers were bf16, but when i use multiple GPU with zeropp i saw activations were fp16. so i think it is deepspeed's issue.

jacklanda commented 5 months ago

Does DeepSpeed supports bf16 now?

YooSungHyun commented 2 months ago

i want to use this function but can not 😥

microsoft / DeepSpeed

[BUG]Zero++ quantizer unsupport BFloat16 #3992