triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Question on how to set --shape when using perf_analyzer #117

Open YJHMITWEB opened 1 year ago

YJHMITWEB commented 1 year ago

Hi, I am trying to use perf_analyzer on the predefined models in fastertransformer, such as gpt, gptj, and etc.

I am very confused about how to properly set the --shape of different inputs when using perf_analyzer.

For example, given the config.ini of the model:

[gpt]
model_name=gpt
max_pos_seq_len=2048 ;for position embedding tables
head_num=12
size_per_head=64
inter_size=3072
num_layer=12
vocab_size=50257
start_id=50256
end_id=50256
weight_data_type=fp32
prompt_learning_start_id=50257
prompt_learning_type=3
num_tasks=1

[task_0]
task_name=sentiment
prompt_length=10

[task_1]
task_name=intent_and_slot
prompt_length=10

[task_2]
task_name=squad
prompt_length=16

And given the gpt config.pbtxt under all_models/gpt/fastertransformer:

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt3_345M"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

dynamic_batching {
   max_queue_delay_microseconds: 50000
}

batch_input [
  {
    kind: BATCH_ITEM_SHAPE
    target_name: "input_ids_item_shape"
    data_type: TYPE_INT32
    source_input: "input_ids"
  }
]

# this is not needed when not using request prompt embedding
#batch_input [
#  {
#    kind: BATCH_ITEM_SHAPE
#    target_name: "request_prompt_embedding_item_shape"
#    data_type: TYPE_INT32
#    source_input: "request_prompt_embedding"
#  }
#]

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_context_embeddings"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "request_prompt_embedding"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: false
  },
  {
    name: "request_prompt_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "request_prompt_type"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_reset_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "response_input_lengths"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_embeddings"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/workspace/all_models/gpt/fastertransformer/1/c-model/345m/1-gpu/"
  }
}
parameters {
  key: "int8_mode"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

When I use perf_analyzer, it asks me to specify the --shape of the following inputs: bad_words_list, input_ids, request_output_len, request_prompt_embedding, stop_words_list, so I set them as

perf_analyzer -m fastertransformer  --shape bad_words_list:2,2 --shape input_ids:10 --shape request_output_len:30 --shape request_prompt_embedding:10,768 --shape stop_words_list:2,3

But this gives errors like the following:

[FT][ERROR] The session_len_ (1936311921) of request is longer than max_seq_len (2048) of embedding table. This is a invalid input. Setting the session_len_ to 2048.
[FT][WARNING] beam_width = 1936311911 is invalid. Set it to 1 to use sampling by default.

It seems there are some memory illegal access. So, what I expect is the batch size to be 10, and each output with length of 10, for the rest params, I am confused about why and how I should set them.

byshiue commented 1 year ago

It looks you use random number for some argument like request_output_len and the random number is invalid.

YJHMITWEB commented 1 year ago

It looks you use random number for some argument like request_output_len and the random number is invalid.

Hi, @byshiue , I see what you mean, basically --shape only specifies the shape instead of the actual value. So it should be that we only need the shape of request_output_len to be 1, and specify the actual value to 30. I am wondering if it is possible to pass the value to perf_analyzer or if it has to be done in Python script assuming there is such a perf_analyzer API to call?

byshiue commented 1 year ago

You can pass values to perf_analyzer. For more details, you can ask in tritonserver repo.