I noticed that loading sliced model and then execute mode.generate() will return wrong output compared to the dense model. From the run_benchmark.py I could get limited info about how to run the sliced model. So is it possible to provide an inference toy demo for sliced model? So we can run the dense and sliced model under the same prompt and compare the outputs. Thanks.
Part of the inference code from gpu_utils.py:
for i in tqdm(range(input_seq_len), desc="Benchmarking"):
input_ids_i = input_ids[:, i].reshape((batch_size, 1)).to(config.device)
attention_mask_i = attention_mask[:, : (i + 1)].to(config.device)
sync_gpus()
start_time = time.time()
output = model_adapter.model(input_ids_i, past_key_values=cache["past"], attention_mask=attention_mask_i)
sync_gpus()
time_measurements.append(time.time() - start_time)
cache["past"] = list(output.past_key_values)
del output
input_ids_i, attention_mask_i = input_ids_i.to("cpu"), attention_mask_i.to("cpu")
I noticed that loading sliced model and then execute
mode.generate()
will return wrong output compared to the dense model. From therun_benchmark.py
I could get limited info about how to run the sliced model. So is it possible to provide an inference toy demo for sliced model? So we can run the dense and sliced model under the same prompt and compare the outputs. Thanks.Part of the inference code from
gpu_utils.py
: