Summary

Update benchmark_pipeline such that it can work for both v1 and v2 pipelines. v1 uses a different timer compared to v2 and the changes in this PR allow support for both. v2 also requires middleware to be passed to the pipeline constructor when creating it
Clean-up the middleware code. There are a lot of keys being passed around that were stored as strings. To have one source of truth, the middleware/timer files were updated to include the constants
Also, exposes the input_schema and output_schema at the pipeline level

Testing

The following both work for v1 and v2 when running the benchmark pipeline:

For text_generation:

deepsparse.benchmark_pipeline text_generation  "hf:mgoin/TinyStories-1M-ds"

Output:


2024-01-08 22:01:22 deepsparse.benchmark.helpers WARNING  No input configuration file provided, using default.
2024-01-08 22:01:22 deepsparse.benchmark.benchmark_pipeline INFO     Original Model Path: hf:mgoin/TinyStories-1M-ds
2024-01-08 22:01:22 deepsparse.benchmark.benchmark_pipeline INFO     Task: text_generation
2024-01-08 22:01:22 deepsparse.benchmark.benchmark_pipeline INFO     Batch Size: 1
2024-01-08 22:01:22 deepsparse.benchmark.benchmark_pipeline INFO     Scenario: sync
2024-01-08 22:01:22 deepsparse.benchmark.benchmark_pipeline INFO     Requested Run Time(sec): 10
2024-01-08 22:01:22 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 50149.29it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240108 COMMUNITY | (9cc30393) (release) (optimized) (system=avx2, binary=avx2)
Original Model Path: hf:mgoin/TinyStories-1M-ds
Batch Size: 1
Scenario: sync
Iterations: 31
Total Runtime: 10.2388
Throughput (items/sec): 3.0277
Processing Time Breakdown: 
     ParseTextGenerationInputs: 0.00%
     ProcessInputsTextGeneration: 0.14%
     PrepareforPrefill: 0.02%
     MultiEnginePrefill: 0.54%
     NLEngineOperator: 85.58%
     CompilePromptLogits: 0.03%
     PrepareGeneration: 3.96%
     AutoRegressiveOperatorPreprocess: 0.87%
     GenerateNewTokenOperator: 1.11%
     CompileGeneratedTokens: 0.15%
     CompileGenerations: 0.42%
     JoinOutput: 1.64%
     ProcessOutputs: 0.06%
     total_inference: 99.99%
Mean Latency Breakdown (ms/batch): 
     ParseTextGenerationInputs: 0.0085
     ProcessInputsTextGeneration: 0.4521
     PrepareforPrefill: 0.0555
     MultiEnginePrefill: 0.0851
     NLEngineOperator: 2.3954
     CompilePromptLogits: 0.0048
     PrepareGeneration: 13.0736
     AutoRegressiveOperatorPreprocess: 0.0298
     GenerateNewTokenOperator: 0.0378
     CompileGeneratedTokens: 0.0051
     CompileGenerations: 1.3707
     JoinOutput: 5.4254
     ProcessOutputs: 0.1875
     total_inference: 330.2510

For v1:


deepsparse.benchmark_pipeline text_classification zoo:nlp/sentiment_analysis/distilbert-none/pytorch/huggingface/sst2/pruned90-none

Output:


2024-01-08 22:11:56 deepsparse.benchmark.helpers WARNING  No input configuration file provided, using default.
2024-01-08 22:11:56 deepsparse.benchmark.benchmark_pipeline INFO     Original Model Path: zoo:nlp/sentiment_analysis/distilbert-none/pytorch/huggingface/sst2/pruned90-none
2024-01-08 22:11:56 deepsparse.benchmark.benchmark_pipeline INFO     Task: text_classification
2024-01-08 22:11:56 deepsparse.benchmark.benchmark_pipeline INFO     Batch Size: 1
2024-01-08 22:11:56 deepsparse.benchmark.benchmark_pipeline INFO     Scenario: sync
2024-01-08 22:11:56 deepsparse.benchmark.benchmark_pipeline INFO     Requested Run Time(sec): 10
2024-01-08 22:11:56 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240108 COMMUNITY | (9cc30393) (release) (optimized) (system=avx2, binary=avx2)

Original Model Path: zoo:nlp/sentiment_analysis/distilbert-none/pytorch/huggingface/sst2/pruned90-none
Batch Size: 1
Scenario: sync
Iterations: 1741
Total Runtime: 10.0039
Throughput (items/sec): 174.0316
Processing Time Breakdown: 
     engine_forward: 88.19%
     total_inference: 99.57%
     post_process: 0.88%
     pre_process: 10.33%
Mean Latency Breakdown (ms/batch): 
     engine_forward: 5.0676
     total_inference: 5.7216
     post_process: 0.0506
     pre_process: 0.5937

neuralmagic / deepsparse

[timer][benchmark_pipeline] Update the `benchmark_pipeline` to work with v1 and v2; clean-up middleware/timer_middleware #1519

Summary

Testing