triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

How much VRAM BLOOM consumes? #76

Open pai4451 opened 1 year ago

pai4451 commented 1 year ago

Hi, thanks for supporting the BLOOM model in the latest release of fastertransformer backend.

I tried the latest code on my 8x A6000 GPU server with 48G ram per GPU (the total ram will be 384GB). After converting the BLOOM model by FasterTransformer/examples/PyTorch/gpt/utility/huggingface_bloom_convert.py with tp=8 and dt=fp16 the resulting checkpoint is about 330G.

But after I ran the tritonserver, the following out of memory error occurs what(): [FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:32.

I wonder how much memory BLOOM-176b consumes with fastertransformer backend? I can run BLOOM inference on my 8x A6000 server with the HuggingFace package, so it seems that fastertransformer library will allocate memory more than the model itself.

byshiue commented 1 year ago

Can you post your config.pbtxt of triton and the config.ini of model?

pai4451 commented 1 year ago

Hi @byshiue,

My config.pbtxt is taken from config.pbtxt, and I only add some dynamic batching settings. I also changed input_ids to allow ragged batch, is this settings correct if I want to enable dynamic batching?

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "bloom"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

dynamic_batching {
   max_queue_delay_microseconds: 50000
}

batch_input [
  {
    kind: BATCH_ITEM_SHAPE
    target_name: "input_ids_item_shape"
    data_type: TYPE_INT32
    source_input: "input_ids",
  }
]

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ],
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_context_embeddings"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "request_prompt_embedding"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
  },
  {
    name: "request_prompt_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "request_prompt_type"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_reset_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "response_input_lengths"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_embeddings"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "8"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/data/hf/tmp/8-gpu/"
  }
}
parameters {
  key: "int8_mode"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

and the parameters in config.ini is reference from this file

[gpt]
model_name=bloom
num_layer=70
head_num=112
inter_size=57344
size_per_head=128
vocab_size=250880
tensor_para_size=8
weight_data_type=fp16
model_variant=bloom-pre
layernorm_eps=1e-05
layernorm_type=pre_layernorm
activation_type=Gelu
has_positional_encoding=False
has_pre_decoder_layernorm=True
has_post_decoder_layernorm=True
use_attention_linear_bias=True
start_id=1
end_id=2

I can run BLOOM-560m or BLOOM-7b without issue on a single A6000. But so far, I cannot run BLOOM with fastertransformer backend on 8x A6000. The way I convert the BLOOM checkpoint is from the docs python3 {FT_DIR}/examples/pytorch/gpt/utils/huggingface_bloom_convert.py -o /data/hf/tmp/ -i ./bloom/ -tp 8 -dt fp16.

byshiue commented 1 year ago

The reason is because we don't split the vocab embedding table, and assume the table of embedding lookup and logit compute can be different. So, these two tables take 13 GBs. So, running bloom on FT requires at least 55GBs.

pai4451 commented 1 year ago

Any reason (e.g. performance) for not splitting the vocab embedding? If a single GPU needs to allocate 13GBs for that, then for tensor parallelism = 8 it would take 104GBs of repeated data.

byshiue commented 1 year ago

If we split the model, we need to introduce a communication after embedding lookup. We will consider splitting them or not in the future.

pai4451 commented 1 year ago

@byshiue Thanks for considering it.

Currently, I use the HuggingFace text-generation-inference server, which they used in production at HuggingFace, and I can serve BLOOM-176B on 8x A6000 with this framework. Here is how they load and deal with the tensor parallelism sharded model for your reference.