microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.91k stars 175 forks source link

Updating transformers issue with bloom models #541

Open loadams opened 3 weeks ago

loadams commented 3 weeks ago

Updating to transformers versions beyond v4.43.4 causes issues with the CI tests in the legacy mode. The bloom tests fail with:

FAILED test_non_persistent_deployment.py::test_single_GPU[None-50050-False-28080-fp16-1-False-False-1-True-False-ds_config0-text-generation-bigscience/bloom-560m-query3-non-persistent] - ValueError: not enough values to unpack (expected 2, got 0)
FAILED test_local_deployment.py::test_session[None-local-50050-False-28080-fp16-1-False-False-1-True-False-ds_config0-text-generation-bigscience/bloom-560m-query0] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_local_deployment.py::test_multi_GPU[None-local-50050-False-28080-fp16-1-False-False-1-True-False-ds_config0-text-generation-bigscience/bloom-560m-query0] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_local_deployment.py::test_single_GPU[None-local-50050-False-28080-fp16-1-False-False-1-True-False-ds_config0-text-generation-bigscience/bloom-560m-query3] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_deployment_options.py::test_meta_tensor[query0-None-bigscience/bloom-560m-local-50050-False-28080-text-generation-fp16-False-1-True-False-ds_config0-2-True] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_deployment_options.py::test_load_to_sys_mem[query0-None-bigscience/bloom-560m-local-50050-False-28080-text-generation-fp16-1-False-1-True-False-ds_config0-True] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_deployment_options.py::test_restful_api[query0-28080-None-bigscience/bloom-560m-local-50050-text-generation-fp16-1-False-False-1-True-False-ds_config0-True] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
FAILED test_deployment_options.py::test_replicas[query0-None-bigscience/bloom-560m-local-50050-False-28080-text-generation-fp16-1-False-False-True-False-ds_config0-2] - grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:

We have isolated the problematic commit to this one: https://github.com/huggingface/transformers/pull/31445

../../mii/legacy/client.py:144: in query
    return task_methods.run_inference(inference_pipeline, args, query_kwargs)
../../mii/legacy/method_table.py:101: in run_inference
    response = inference_pipeline(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/text_generation.py:262: in __call__
    return super().__call__(text_inputs, **kwargs)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/base.py:1238: in __call__
    outputs = list(final_iterator)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py:124: in __next__
    item = next(self.iterator)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/pt_utils.py:125: in __next__
    processed = self.infer(item, **self.params)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/base.py:1164: in forward
    model_outputs = self._forward(model_inputs, **forward_params)
../../../venv/lib/python3.12/site-packages/transformers/pipelines/text_generation.py:351: in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
../../../venv/lib/python3.12/site-packages/deepspeed/inference/engine.py:631: in _generate
    return self.module.generate(*inputs, **kwargs)
../../../venv/lib/python3.12/site-packages/torch/utils/_contextlib.py:116: in decorate_context
    return func(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/transformers/generation/utils.py:2024: in generate
    result = self._sample(
../../../venv/lib/python3.12/site-packages/transformers/generation/utils.py:2982: in _sample
    outputs = self(**model_inputs, return_dict=True)
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1747: in _call_impl
    return forward_call(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/transformers/models/bloom/modeling_bloom.py:955: in forward
    transformer_outputs = self.transformer(
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1747: in _call_impl
    return forward_call(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/transformers/models/bloom/modeling_bloom.py:744: in forward
    outputs = block(
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1747: in _call_impl
    return forward_call(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py:162: in forward
    self.attention(input,
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1747: in _call_impl
    return forward_call(*args, **kwargs)
../../../venv/lib/python3.12/site-packages/deepspeed/ops/transformer/inference/ds_attention.py:168: in forward
    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,