microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.51k stars 4.13k forks source link

[BUG]Zero++ quantizer unsupport BFloat16 #3992

Open vv12kant opened 1 year ago

vv12kant commented 1 year ago

Describe the bug When using both the Zero++ and BFloat16 features simultaneously. Sometimes the gathered param is Float15 dtype,but the intermediate result are still BFloat16 dtype.

To Reproduce Steps to reproduce the behavior:

  1. set deepspeed configuration,enable zero++ and bf16.

    ds_config = {
    "train_batch_size": GLOBAL_BATCH_SIZE,
    "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
    "steps_per_print": 10,
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": 'cpu'
        },
        "offload_optimizer": {
            "device": 'cpu'
        },
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 3e7,
        "stage3_prefetch_bucket_size": 3e7,
        "memory_efficient_linear": False,
    
        "zero_quantized_weights": True,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": True,
    },
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": False,
    "wall_clock_breakdown": False,
    "hybrid_engine": {
        "enabled": True,
        "inference_tp_size": 8,
        "release_inference_cache": release_inference_cache,
        "pin_parameters": pin_parameters,
        "tp_gather_partition_size": 8,
        "max_out_tokens": 512,
    }
    }
  2. Create a Bloom HF model,and initialize model engine by ds_config.

    model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
    actor_engine, *_ = deepspeed.initialize(model=model, optimizer=optim, config=ds_config)
  3. run inference.

  4. See error.

    Traceback (most recent call last):
    File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 245, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1568, in generate
    return self.sample(
    File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2615, in sample
    outputs = self(
    File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
    result = forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 927, in forward
    lm_logits = self.lm_head(hidden_states)
    File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
    result = forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/layers.py", line 52, in forward
    output = torch.matmul(input, self.weight.transpose(-1, -2))
    RuntimeError: expected scalar type BFloat16 but found Half
Desein-Yang commented 1 year ago

I guess it results from transformers/modeling_bloom and it misses a type adaption.

azahed98 commented 1 year ago

I am seeing the same issue with llama 2. Were you able to get this working?

SeunghyunSEO commented 8 months ago

I guess it results from transformers/modeling_bloom and it misses a type adaption.

hi guys. i encounter same issue and i guess this is not because of specific modeling_xxx.py. in my case, when i casting model with bfloa16 and use single GPU, activations of all layers were bf16, but when i use multiple GPU with zeropp i saw activations were fp16. so i think it is deepspeed's issue.

jacklanda commented 5 months ago

Does DeepSpeed supports bf16 now?

YooSungHyun commented 2 months ago

i want to use this function but can not 😥