triton-inference-server / fil_backend

FIL backend for the Triton Inference Server
Apache License 2.0
68 stars 35 forks source link

fil_backend's performance is much worse on CPU than onnx backend for sklearn #168

Closed DayDayupupupup closed 2 years ago

DayDayupupupup commented 2 years ago

Brief description

I have a SkLearn model and deploy the model on the CPU using FIL and ONNX backend respectively. When comparing inference performance using the Perf_Analyzer tool, FIL's performance was significantly worse than ONNX's when Batch =256

Environment

df = pd.read_csv('Breast_cancer_data.csv') X = df[['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']] y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

define the model

model = RandomForestClassifier()

fit the model on the whole dataset

model.fit(X_train, y_train) ypred = model.predict(X_test) print("Predicted Class:", ypred) y_proba = model.predict_proba(X_test) print("Predicted proba:", y_proba) import pickle with open("model.pkl", 'wb') as model_file: pickle.dump(model, model_file)

2. Convert to treelite checkpoint
```python
python -m treelite.serialize --input-model model.pkl --input-model-type sklearn_pkl --output-checkpoint checkpoint.tl 
  1. Convert to onnx
    
    from skl2onnx import convert_sklearn
    from skl2onnx.common.data_types import FloatTensorType
    import pickle
    import numpy as np
    import onnx
    f = open("model.pkl", 'rb')
    model = pickle.load(f)
    model.ir_version = 7
    f.close()
    onnx_model_path = 'model.onnx'

initial_type = [('input', FloatTensorType([None, 5]))] onx = convert_sklearn(model, initial_types=initial_type, verbose=2, target_opset=12, options={'zipmap': False}) output_map = {output.name: output for output in onx.graph.output}

delete output label

onx.graph.output.remove(output_map['label'])

with open(onnx_model_path, "wb") as f: f.write(onx.SerializeToString())

4. Model configuration
```python
├── sklearn_fil
│   ├── 1
│   │   └── checkpoint.tl
│   └── config.pbtxt
└── sklearn_onnx
    ├── 20220112161732
    │   └── model.onnx
    └── config.pbtxt
name: "sklearn_fil"
backend: "fil"
max_batch_size: 2048
input {
  name: "input__0"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "output__0"
  data_type: TYPE_FP32
  dims: 2
}
instance_group {
  count: 1
  kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 256
  preferred_batch_size: 512
  preferred_batch_size: 1024
  max_queue_delay_microseconds: 100
}
model_warmup {
  name: "warmup_data"
  batch_size: 1
  inputs {
    key: "input__0"
    value {
      data_type: TYPE_FP32
      dims: 5
      zero_data: true
    }
  }
}
parameters { key: "model_type" value: { string_value: "treelite_checkpoint" } }
parameters { key: "predict_proba" value: { string_value: "true" } }
parameters { key: "output_class" value: { string_value: "true" } }
parameters { key: "threshold" value: { string_value: "0.5" } }
parameters { key: "algo" value: { string_value: "ALGO_AUTO" } }
parameters { key: "storage_type" value: { string_value: "AUTO" } }
parameters { key: "blocks_per_sm" value: { string_value: "0" } }
parameters { key: "threads_per_tree" value: { string_value: "1" } }
parameters { key: "transfer_threshold" value: { string_value: "0" } }
name: "sklearn_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 2048
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "probabilities"
  data_type: TYPE_FP32
  dims: 2
}
instance_group {
  count: 1
  kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 256
  preferred_batch_size: 512
  preferred_batch_size: 1024
  max_queue_delay_microseconds: 100
}
model_warmup {
  name: "warmup_data"
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: 5
      zero_data: true
    }
  }
}
parameters { key: "intra_op_thread_count" value: { string_value: "8" } }
parameters { key: "cudnn_conv_algo_search" value: { string_value: "1" } }
  1. Deploy the Triton service
    CUDA_VISIBLE_DEVICES=-1 tritonserver --model-repository=/models/perf_test/ --strict-model-config=false --log-verbose=0 --metrics-port=6000
  2. Performance analysis batch=1
    
    ./perf_analyzer -a -b 1 -u localhost:8001 -i gRPC -m sklearn_fil --concurrency-range 1
    *** Measurement Settings ***
    Batch size: 1
    Measurement window: 5000 msec
    Using asynchronous calls for inference
    Stabilizing using average latency

Request concurrency: 1 Client: Request count: 19235 Throughput: 3847 infer/sec Avg latency: 253 usec (standard deviation 489 usec) p50 latency: 249 usec p90 latency: 256 usec p95 latency: 264 usec p99 latency: 283 usec Avg gRPC time: 250 usec ((un)marshal request/response 0 usec + response wait 250 usec) Server: Inference count: 23145 Execution count: 23145 Successful request count: 23145 Avg request latency: 204 usec (overhead 1 usec + queue 161 usec + compute input 1 usec + compute infer 20 usec + compute output 21 usec)

Inferences/Second vs. Client Average Batch Latency Concurrency: 1, throughput: 3847 infer/sec, latency 253 usec

```python
./perf_analyzer -a -b 1 -u localhost:8001 -i gRPC -m sklearn_onnx --concurrency-range 1
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 15304
    Throughput: 3060.8 infer/sec
    Avg latency: 316 usec (standard deviation 456 usec)
    p50 latency: 307 usec
    p90 latency: 317 usec
    p95 latency: 321 usec
    p99 latency: 369 usec
    Avg gRPC time: 312 usec ((un)marshal request/response 1 usec + response wait 311 usec)
  Server:
    Inference count: 18445
    Execution count: 18445
    Successful request count: 18445
    Avg request latency: 332 usec (overhead 139 usec + queue 168 usec + compute input 5 usec + compute infer 16 usec + compute output 4 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 3060.8 infer/sec, latency 316 usec

batch=256

./perf_analyzer -a -b 256 -u localhost:8001 -i gRPC -m sklearn_fil --concurrency-range 1
*** Measurement Settings ***
  Batch size: 256
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 1108
    Throughput: 56729.6 infer/sec
    Avg latency: 4504 usec (standard deviation 1101 usec)
    p50 latency: 4417 usec
    p90 latency: 4490 usec
    p95 latency: 4688 usec
    p99 latency: 5711 usec
    Avg gRPC time: 4492 usec ((un)marshal request/response 6 usec + response wait 4486 usec)
  Server:
    Inference count: 340992
    Execution count: 1332
    Successful request count: 1332
    Avg request latency: 4390 usec (overhead 2 usec + queue 20 usec + compute input 1 usec + compute infer 4339 usec + compute output 28 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 56729.6 infer/sec, latency 4504 usec
./perf_analyzer -a -b 256 -u localhost:8001 -i gRPC -m sklearn_onnx --concurrency-range 1
*** Measurement Settings ***
  Batch size: 256
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 17011
    Throughput: 870963 infer/sec
    Avg latency: 286 usec (standard deviation 120 usec)
    p50 latency: 277 usec
    p90 latency: 311 usec
    p95 latency: 322 usec
    p99 latency: 382 usec
    Avg gRPC time: 283 usec ((un)marshal request/response 5 usec + response wait 278 usec)
  Server:
    Inference count: 5169408
    Execution count: 20193
    Successful request count: 20193
    Avg request latency: 266 usec (overhead 96 usec + queue 19 usec + compute input 5 usec + compute infer 142 usec + compute output 4 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 870963 infer/sec, latency 286 usec

Current Result

When batch=1, FIL performance slightly better than ONNX

But when batch=256,FIL performance is much worse than ONNX

Expected Result

  1. Why does this happen? Did I make a mistake?
  2. Is FIL more suitable for GPU-based machine learning like cuML?
wphicks commented 2 years ago

Thanks for submitting this, @DayDayupupupup! Until recently, we have focused on maximizing the FIL backend's performance on GPUs, with CPU-only mode provided primarily for prototyping and quick development tests. We are going to be pushing CPU performance more in the near future by optimizing the upstream library used for CPU inference. Work has already begun there, but I do not have an immediate estimate on when that update is likely to come through.

I just reran your exact scenario but switched KIND_CPU to KIND_GPU for both ONNX and FIL. It is worth noting that FIL offers the greatest benefit for large batch sizes and complex models, but even with the relatively simple model we're using here, we see some improvement in both throughput and latency over ONNX (Quadro GTX8000):

FIL Results

*** Measurement Settings ***                                                                                                                                                                                       
  Batch size: 256                                                                                                                                                                                                  
  Using "time_windows" mode for stabilization                                                                                                                                                                      
  Measurement window: 5000 msec                                                                                                                                                                                    
  Using asynchronous calls for inference                                                                                                                                                                           
  Stabilizing using average latency                                                                                                                                                                                

Request concurrency: 1                                                                                                                                                                                             
  Client:                                                                                                                                                                                                          
    Request count: 10011                                                                                                                                                                                           
    Throughput: 512563 infer/sec                                                                                                                                                                                   
    Avg latency: 478 usec (standard deviation 115 usec)                                                                                                                                                            
    p50 latency: 456 usec                                                                                                                                                                                          
    p90 latency: 601 usec                                                                                                                                                                                          
    p95 latency: 637 usec                                                                                                                                                                                          
    p99 latency: 739 usec                                                                                                                                                                                          
    Avg gRPC time: 478 usec ((un)marshal request/response 10 usec + response wait 468 usec)                                                                                                                        
  Server:                                                                                                                                                                                                          
    Inference count: 3036672                                                                                                                                                                                       
    Execution count: 11862                                                                                                                                                                                         
    Successful request count: 11862                                                                                                                                                                                
    Avg request latency: 245 usec (overhead 2 usec + queue 58 usec + compute input 34 usec + compute infer 51 usec + compute output 100 usec)                                                                      

Inferences/Second vs. Client Average Batch Latency                                                                                                                                                                 
Concurrency: 1, throughput: 512563 infer/sec, latency 478 usec 

ONNX Results

*** Measurement Settings ***                                                                                                                                                                                       
  Batch size: 256                                                                                                                                                                                                  
  Using "time_windows" mode for stabilization                                                                                                                                                                      
  Measurement window: 5000 msec                                                                                                                                                                                    
  Using asynchronous calls for inference                                                                                                                                                                           
  Stabilizing using average latency                                                                                                                                                                                

Request concurrency: 1                                                                                                                                                                                             
  Client:                                                                                                                                                                                                          
    Request count: 7993                                                                                                                                                                                            
    Throughput: 409242 infer/sec                                                                                                                                                                                   
    Avg latency: 614 usec (standard deviation 1404 usec)                                                                                                                                                           
    p50 latency: 424 usec                                                                                                                                                                                          
    p90 latency: 773 usec                                                                                                                                                                                          
    p95 latency: 780 usec                                                                                                                                                                                          
    p99 latency: 858 usec                                                                                                                                                                                          
    Avg gRPC time: 620 usec ((un)marshal request/response 7 usec + response wait 613 usec)                                                                                                                         
  Server:                                                                                                                                                                                                          
    Inference count: 2423040                                                                                                                                                                                       
    Execution count: 9465                                                                                                                                                                                          
    Successful request count: 9465                                                                                                                                                                                 
    Avg request latency: 531 usec (overhead 130 usec + queue 37 usec + compute input 9 usec + compute infer 351 usec + compute output 4 usec)                                                                      

Inferences/Second vs. Client Average Batch Latency                                                                                                                                                                 
Concurrency: 1, throughput: 409242 infer/sec, latency 614 usec

For a more complete picture, here is a log-log scatterplot of latency-throughput results for both FIL and ONNX on the model trained in your reproducer code. We can see that the FIL backend does not strictly dominate the ONNX backend (ONNX tends to outperform for very low batch size and concurrency), but as concurrency or batch size increase, FIL outperforms in both throughput and latency. latency_throughput

So in general, when running on GPU for anything except very small loads on simple models, FIL should provide better performance than ONNX. On CPU, we expect to make improvements in the near future.

DayDayupupupup commented 2 years ago

Thanks for your detailed answer, @wphicks . Expect FIL's CPU performance improvements in the future!

wphicks commented 2 years ago

@DayDayupupupup A brief update on this: The upcoming release 22.03 will have some CPU performance improvements, and #203 (likely to be included in 22.04) has significantly more. With #203, the same test that we performed before (but this time comparing CPU to CPU) gives us the following results (now presented in a somewhat cleaner form):

168

For the extremely low-latency domain (< 1ms), the ONNX backend still prevails on CPU, but we suspect that this is not due to the speed of execution for the underlying libraries but due to this issue: https://github.com/rapidsai/rapids-triton/issues/22. We'll look into that as well and continue to try to push low-latency performance as well as continuing to optimize performance across all deployment scenarios.