microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
364 stars 36 forks source link

Is there any inference demo for sliced model? #116

Closed zhaoyang-star closed 7 months ago

zhaoyang-star commented 7 months ago

I noticed that loading sliced model and then execute mode.generate() will return wrong output compared to the dense model. From the run_benchmark.py I could get limited info about how to run the sliced model. So is it possible to provide an inference toy demo for sliced model? So we can run the dense and sliced model under the same prompt and compare the outputs. Thanks.

Part of the inference code from gpu_utils.py:

        for i in tqdm(range(input_seq_len), desc="Benchmarking"):
            input_ids_i = input_ids[:, i].reshape((batch_size, 1)).to(config.device)
            attention_mask_i = attention_mask[:, : (i + 1)].to(config.device)

            sync_gpus()
            start_time = time.time()
            output = model_adapter.model(input_ids_i, past_key_values=cache["past"], attention_mask=attention_mask_i)
            sync_gpus()
            time_measurements.append(time.time() - start_time)

            cache["past"] = list(output.past_key_values)
            del output

            input_ids_i, attention_mask_i = input_ids_i.to("cpu"), attention_mask_i.to("cpu")