sliding window size is 2047 for microsoft/phi-3-mini-4k-instruct variant in phi3. so following operations will be executed
if self.sliding_window is not None:
diagonal = past_key_values_length - self.sliding_window - 1
context_mask = torch.tril(torch.ones_like(mask, dtype=torch.bool), diagonal=diagonal)
mask.masked_fill_(context_mask, torch.finfo(dtype).min)
In our case , past_key_values_length=0 so, diagonal = -2048. As diagonal value is not passed as paramater to tril function it using default diagonal value(k=0) which results in pcc drop.
Tril operation in tvm didn't handled the cases where diagonal value !=0
So, Updated tril function paramater with diagonal value and if it is expression, our changes will pass the extracted value to np.tril()
Input for np.tril is bool type(torch.ones_like(mask, dtype=torch.bool)) here . hence modified _convert_tvm_to_np_dtype to handle bool data type
In our case , past_key_values_length=0 so, diagonal = -2048. As diagonal value is not passed as paramater to tril function it using default diagonal value(k=0) which results in pcc drop.
Tril operation in tvm didn't handled the cases where diagonal value !=0
So, Updated
tril
function paramater with diagonal value and if it is expression, our changes will pass the extracted value to np.tril()Input for np.tril is bool type(torch.ones_like(mask, dtype=torch.bool)) here . hence modified
_convert_tvm_to_np_dtype
to handle bool data type