Open TaoZQY opened 1 month ago
I am still a little confused about the lm-head OPs computation The lm-head input is (1,hiddensize), the output is (1, vocab_size), The OPs is args['batchsize']hiddensizevocab_size*2 (Matrix multiplication) But the code is
def `post_process(model_params,args):`
hiddensize=get_hidden_size(model_params)
vocab_size=get_vocab_size(model_params)
layers=[]
for stage in ["prefill", "decode"]:
layers.append({
'name': 'lm_head',
'stage':stage,
'OPs':args['batchsize']*hiddensize*vocab_size*1,
'load_weight':hiddensize*vocab_size *args['w_byte'],
'load_act':hiddensize*args['a_byte'],
'store_act':vocab_size*args['a_byte'],
})
return layers
Overall, whether the amount of computation of lm-head should be:
def post_process(model_params,args):
hiddensize=get_hidden_size(model_params)
vocab_size=get_vocab_size(model_params)
layers=[]
for stage in ["prefill", "decode"]:
layers.append({
'name': 'lm_head',
'stage':stage,
'OPs':args['batchsize']*hiddensize*vocab_size*2, // Matrix multiplication
'load_weight':hiddensize*vocab_size *args['w_byte'],
'load_act':hiddensize*args['a_byte'],
'store_act':vocab_size*args['a_byte'],
})
return layers
weight_kv_footprint = total_results["prefill"]["load_weight"] + total_results["prefill"]["store_kv_cache"]
decode_tmp_act = 0
for layer_name, result in self.results["decode"].items():
decode_tmp_act += result["store_act"] # activation is discarded after one layer
total_results["decode"]["memory_consumption"] = decode_tmp_act + weight_kv_footprint
total_results["decode"]["memory_consumption_tmp_act"] = decode_tmp_act
total_results["decode"]["memory_consumption_weight"] = total_results["prefill"]["load_weight"]
total_results["decode"]["memory_consumption_kv_cache"] = total_results["prefill"]["store_kv_cache"]
prefill_tmp_act = 0
for layer_name, result in self.results["prefill"].items():
prefill_tmp_act += result["store_act"]
total_results["prefill"]["memory_consumption"] = prefill_tmp_act + weight_kv_footprint
total_results["prefill"]["memory_consumption_tmp_act"] = prefill_tmp_act
total_results["prefill"]["memory_consumption_weight"] = total_results["prefill"]["load_weight"]
total_results["prefill"]["memory_consumption_kv_cache"] = total_results["prefill"]["store_kv_cache"]
# lm_head
name = "lm_head"
args = {"batchsize": batchsize, "a_byte": a_byte, "w_byte": w_byte}
for layer_info in self.config.post_process(self.model_params, args):
self._analyze_to_results(**layer_info)
for data_name in ALL_DATA_NAMES:
total_results[layer_info["stage"]][data_name] += self.results[layer_info["stage"]][layer_info["name"]][
data_name
]
The lm_head is included in both the prefill and decode phases, but the order of this code is not correct. The number of lm_head parameters should be added to totalresult first, and then the memory consumption between prefill and deocde in totalresult is calculated
The decode stage Kv cache memory consumption should include the kv store of the prefill and decode itself (although small).
Overall, the correct code should be:
name = "lm_head"
args = {"batchsize": batchsize, "a_byte": a_byte, "w_byte": w_byte}
for layer_info in self.config.post_process(self.model_params, args):
self._analyze_to_results(**layer_info)
for data_name in ALL_DATA_NAMES:
total_results[layer_info["stage"]][data_name] += self.results[layer_info["stage"]][layer_info["name"]][
data_name
]
weight_kv_footprint = total_results["prefill"]["load_weight"] + total_results["prefill"]["store_kv_cache"]+ total_results["decode"]["store_kv_cache"]
decode_tmp_act = 0
for layer_name, result in self.results["decode"].items():
decode_tmp_act += result["store_act"] # activation is discarded after one layer
total_results["decode"]["memory_consumption"] = decode_tmp_act + weight_kv_footprint
total_results["decode"]["memory_consumption_tmp_act"] = decode_tmp_act
total_results["decode"]["memory_consumption_weight"] = total_results["prefill"]["load_weight"]
total_results["decode"]["memory_consumption_kv_cache"] = total_results["prefill"]["store_kv_cache"]+ total_results["decode"]["store_kv_cache"]
prefill_tmp_act = 0
for layer_name, result in self.results["prefill"].items():
prefill_tmp_act += result["store_act"]
total_results["prefill"]["memory_consumption"] = prefill_tmp_act + weight_kv_footprint
total_results["prefill"]["memory_consumption_tmp_act"] = prefill_tmp_act
total_results["prefill"]["memory_consumption_weight"] = total_results["prefill"]["load_weight"]
total_results["prefill"]["memory_consumption_kv_cache"] = total_results["prefill"]["store_kv_cache"]
block_size_r = min(math.ceil(onchip_buffer / (kv_byte * head_size)), head_size)
In the flashattention2 paper, Br=[M/4d], Is there a problem?
for decode stage , o_nume=[1,d]
o_numel = 1 * seqlen * batchsize * num_attention_heads * a_byte
Is there a problem?
The correct code may o_numel = 1 * head_size * batchsize * num_attention_heads * a_byte
?
Thank you for your thorough code review and detailed analysis. You've identified several important calculation adjustments that need to be made:
For the MLP activation layer:
For the LM head computation:
Regarding memory consumption calculation:
For FlashAttention-2 related block sizes parameters:
The output tensor dimensions for the decode stage need to be adjusted
We'll update these calculation methods in the code. Let’s ensure that our calculations reflect the correct logic and dimensions as you've outlined. I hadn't realized there were so many issues in this code. It’s clear that this project needs a thorough revision and update. If you're interested, I’d love to discuss potential next steps.
Dear author, I'm truly grateful that you could notice my question in time. Your professionalism and responsibility have deeply impressed me, and I sincerely hope that I can learn more from you in the future. Next, I plan to modify the code and submit a pull request for you to review. I'm looking forward to having the opportunity to cooperate with you and your team to make progress together. Thank you again!
The amount of computation for mlp_act in the latest code is
I have the following questions:
The dimension of the input to the LLama activation layer is intermediate_size // tp_size "gate_proj":[hidden_size, intermediate_size // tp_size], "up_proj":[hidden_size,intermediate_size // tp_size], The calculation for whether to activate should be intermediate_size // tp_size?
The SILU is calculated by the formula SILU(x)=x⋅sigmoid(x)
Overall, whether the amount of computation of mlp_act should be: