neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

Use Int KV Cache as default for deepsparse.benchmark #1512

Closed horheynm closed 6 months ago

horheynm commented 6 months ago

Description

Tested two entrypoints for deepsparse.benchmark. One used internal KV and other used external. Goal is to always use internal KV.

  1. Python
    
    from deepsparse.benchmark.benchmark_model import benchmark_model

stub = "zoo:mistral-7b-gsm8k_mistral_pretrain-pruned80" results = benchmark_model(stub) print(results)


2. CLI
```bash
deepsparse.benchmark "zoo:mistral-7b-gsm8k_mistral_pretrain-pruned80"

Results

  1. Python
    {
    "engine":"deepsparse.engine.Engine:\n\tonnx_file_path: /home/george/.cache/sparsezoo/neuralmagic/mistral-7b-gsm8k_mistral_pretrain-pruned80/deployment/model.onnx\n\tbatch_size: 1\n\tnum_cores: 32\n\tnum_streams: 1\n\tscheduler: Scheduler.default\n\tfraction_of_supported_ops: 1.0\n\tcpu_avx_type: avx2\n\tcpu_vnni: False",
    "version":"1.7.0.20240104",
    "orig_model_path":"zoo:mistral-7b-gsm8k_mistral_pretrain-pruned80",
    "model_path":"/home/george/.cache/sparsezoo/neuralmagic/mistral-7b-gsm8k_mistral_pretrain-pruned80/deployment/model.onnx",
    "batch_size":1,
    "input_shapes":"None",
    "num_cores":32,
    "scenario":"singlestream",
    "scheduler":"Scheduler.default",
    "seconds_to_run":10,
    "num_streams":1,
    "benchmark_result":{
      "scenario":"singlestream",
      "items_per_sec":0.7033625366578768,
      "seconds_ran":15.6391610680148,
      "iterations":11,
      "median":576.397096272558,
      "mean":1421.7223456044767,
      "std":2497.9850669397615,
      "25.0%":493.93453216180205,
      "50.0%":576.397096272558,
      "75.0%":676.7404270358384,
      "90.0%":1781.627886928618,
      "95.0%":5504.294607555494,
      "99.0%":8482.427984056996,
      "99.9%":9152.507993769847
    },
    "fraction_of_supported_ops":1.0,
    "sequence_length":2048,
    "input_ids_length":1
    }
  2. CLI
    {
    "engine":"deepsparse.engine.Engine:\n\tonnx_file_path: /home/george/.cache/sparsezoo/neuralmagic/mistral-7b-gsm8k_mistral_pretrain-pruned80/deployment/model.onnx\n\tbatch_size: 1\n\tnum_cores: 32\n\tnum_streams: 1\n\tscheduler: Scheduler.default\n\tfraction_of_supported_ops: 1.0\n\tcpu_avx_type: avx2\n\tcpu_vnni: False",
    "version":"1.7.0.20240104",
    "orig_model_path":"zoo:mistral-7b-gsm8k_mistral_pretrain-pruned80",
    "model_path":"/home/george/.cache/sparsezoo/neuralmagic/mistral-7b-gsm8k_mistral_pretrain-pruned80/deployment/model.onnx",
    "batch_size":1,
    "input_shapes":null,
    "num_cores":32,
    "scenario":"singlestream",
    "scheduler":"Scheduler.default",
    "seconds_to_run":10,
    "num_streams":1,
    "benchmark_result":{
      "scenario":"singlestream",
      "items_per_sec":1.1406279406751316,
      "seconds_ran":19.287621506955475,
      "iterations":22,
      "median":353.46097755245864,
      "mean":876.6876120670614,
      "std":2260.886819025657,
      "25.0%":286.9633190566674,
      "50.0%":353.46097755245864,
      "75.0%":506.17970793973655,
      "90.0%":652.1910438779744,
      "95.0%":800.4174952395259,
      "99.0%":9027.320535853496,
      "99.9%":10993.794929189637
    },
    "fraction_of_supported_ops":1.0,
    "sequence_length":2048,
    "input_ids_length":1
    }