neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

benchmark server pipeline #1600

Closed horheynm closed 4 months ago

horheynm commented 4 months ago

server running pipeline for benchmark

Note: Cannot have continuous batching with timer middleware.

Configs

Server side

deepsparse.server --config_file config.yaml
num_cores: 2
num_workers: 2
endpoints:
  - task: text_generation
    model: "hf:mgoin/TinyStories-1M-ds"
    # kwargs: {"continuous_batch_sizes": [2]}
    middlewares:
      - TimerMiddleware

client side

import requests

url = "http://localhost:5543/v2/models/text_generation-0/benchmark"

obj = {
    "data_type": "dummy",
    "gen_sequence_length": 100,
    "pipeline_kwargs": {},
    "input_schema_kwargs": {}
} 

response = requests.post(url, json=obj)
print(response.json())

Outputs

server

(.venv) george@gpuserver6:~/deepsparse$ deepsparse.server --config_file config.yaml
/home/george/deepsparse/.venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
2024-02-12 21:45:33 deepsparse.server.server INFO     Using config: ServerConfig(num_cores=2, num_workers=2, integration=None, engine_thread_pinning='core', pytorch_num_threads=1, endpoints=[EndpointConfig(name='text_generation-0', route=None, task='text_generation', model='hf:mgoin/TinyStories-1M-ds', batch_size=1, logging_config=PipelineSystemLoggingConfig(enable=True, inference_details=SystemLoggingGroup(enable=False, target_loggers=[]), prediction_latency=SystemLoggingGroup(enable=True, target_loggers=[])), data_logging=None, bucketing=None, middlewares=['TimerMiddleware'], kwargs={})], loggers={}, system_logging=ServerSystemLoggingConfig(enable=True, 
...
'/docs/oauth2-redirect', '/redoc', '/', '/config', '/v2/health/live', '/v2/health/ready', '/v2', '/endpoints', '/endpoints', '/v2/models/text_generation-0/infer', '/v2/models/text_generation-0/benchmark', '/v2/models/text_generation-0', '/v2/models/text_generation-0/ready']
INFO:     Started server process [3930990]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5543 (Press CTRL+C to quit)
2024-02-12 21:45:36 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
INFO:     127.0.0.1:48914 - "POST /v2/models/text_generation-0/benchmark HTTP/1.1" 200 OK

client

(.venv) george@gpuserver6:~/deepsparse$ python3 -m scratch.server
...
'PrepareGeneration': [0.0017719268798828125], 'GenerateNewTokenOperator': [7.486343383789062e-05, 7
'CompileGeneratedTokens': [1.5974044799804688e-05, 1.4781951904296875e-05, 1.358 ...
...

(.venv) george@gpuserver6:~/deepsparse$ 
horheynm commented 4 months ago

Great job!

  • Lets test with an openai example to make sure the integration works
  • Add testing, specifically around benchmark_pipeline and server integration
  • Refactor benchmark_pipeline function to prevent repeat code

I addressed tests and refactor, but not testing with openai - talked to Ben, dont need it for now