Brief description

I have a SkLearn model and deploy the model on the CPU using FIL and ONNX backend respectively. When comparing inference performance using the Perf_Analyzer tool, FIL's performance was significantly worse than ONNX's when Batch =256

Environment

Development environment: nvcr.io/nvidia/tensorflow:21.12-tf2-py3
Triton version: nvcr.io/nvidia/tritonserver: 21.12

GPU： T4，driver 460.80

Steps to Reproduce the Issue，everything is done inside the container

Generate sklearn model


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('Breast_cancer_data.csv') X = df[['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']] y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

define the model

model = RandomForestClassifier()

fit the model on the whole dataset

model.fit(X_train, y_train) ypred = model.predict(X_test) print("Predicted Class:", ypred) y_proba = model.predict_proba(X_test) print("Predicted proba:", y_proba) import pickle with open("model.pkl", 'wb') as model_file: pickle.dump(model, model_file)

2. Convert to treelite checkpoint
```python
python -m treelite.serialize --input-model model.pkl --input-model-type sklearn_pkl --output-checkpoint checkpoint.tl

Convert to onnx


from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import pickle
import numpy as np
import onnx
f = open("model.pkl", 'rb')
model = pickle.load(f)
model.ir_version = 7
f.close()
onnx_model_path = 'model.onnx'

initial_type = [('input', FloatTensorType([None, 5]))] onx = convert_sklearn(model, initial_types=initial_type, verbose=2, target_opset=12, options={'zipmap': False}) output_map = {output.name: output for output in onx.graph.output}

delete output label

onx.graph.output.remove(output_map['label'])

with open(onnx_model_path, "wb") as f: f.write(onx.SerializeToString())

4. Model configuration
```python
├── sklearn_fil
│   ├── 1
│   │   └── checkpoint.tl
│   └── config.pbtxt
└── sklearn_onnx
    ├── 20220112161732
    │   └── model.onnx
    └── config.pbtxt

name: "sklearn_fil"
backend: "fil"
max_batch_size: 2048
input {
  name: "input__0"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "output__0"
  data_type: TYPE_FP32
  dims: 2
}
instance_group {
  count: 1
  kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 256
  preferred_batch_size: 512
  preferred_batch_size: 1024
  max_queue_delay_microseconds: 100
}
model_warmup {
  name: "warmup_data"
  batch_size: 1
  inputs {
    key: "input__0"
    value {
      data_type: TYPE_FP32
      dims: 5
      zero_data: true
    }
  }
}
parameters { key: "model_type" value: { string_value: "treelite_checkpoint" } }
parameters { key: "predict_proba" value: { string_value: "true" } }
parameters { key: "output_class" value: { string_value: "true" } }
parameters { key: "threshold" value: { string_value: "0.5" } }
parameters { key: "algo" value: { string_value: "ALGO_AUTO" } }
parameters { key: "storage_type" value: { string_value: "AUTO" } }
parameters { key: "blocks_per_sm" value: { string_value: "0" } }
parameters { key: "threads_per_tree" value: { string_value: "1" } }
parameters { key: "transfer_threshold" value: { string_value: "0" } }

name: "sklearn_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 2048
input {
  name: "input"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "probabilities"
  data_type: TYPE_FP32
  dims: 2
}
instance_group {
  count: 1
  kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 256
  preferred_batch_size: 512
  preferred_batch_size: 1024
  max_queue_delay_microseconds: 100
}
model_warmup {
  name: "warmup_data"
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: 5
      zero_data: true
    }
  }
}
parameters { key: "intra_op_thread_count" value: { string_value: "8" } }
parameters { key: "cudnn_conv_algo_search" value: { string_value: "1" } }

Deploy the Triton service

CUDA_VISIBLE_DEVICES=-1 tritonserver --model-repository=/models/perf_test/ --strict-model-config=false --log-verbose=0 --metrics-port=6000

Performance analysis batch=1


./perf_analyzer -a -b 1 -u localhost:8001 -i gRPC -m sklearn_fil --concurrency-range 1
*** Measurement Settings ***
Batch size: 1
Measurement window: 5000 msec
Using asynchronous calls for inference
Stabilizing using average latency

Request concurrency: 1 Client: Request count: 19235 Throughput: 3847 infer/sec Avg latency: 253 usec (standard deviation 489 usec) p50 latency: 249 usec p90 latency: 256 usec p95 latency: 264 usec p99 latency: 283 usec Avg gRPC time: 250 usec ((un)marshal request/response 0 usec + response wait 250 usec) Server: Inference count: 23145 Execution count: 23145 Successful request count: 23145 Avg request latency: 204 usec (overhead 1 usec + queue 161 usec + compute input 1 usec + compute infer 20 usec + compute output 21 usec)

Inferences/Second vs. Client Average Batch Latency Concurrency: 1, throughput: 3847 infer/sec, latency 253 usec

```python
./perf_analyzer -a -b 1 -u localhost:8001 -i gRPC -m sklearn_onnx --concurrency-range 1
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 15304
    Throughput: 3060.8 infer/sec
    Avg latency: 316 usec (standard deviation 456 usec)
    p50 latency: 307 usec
    p90 latency: 317 usec
    p95 latency: 321 usec
    p99 latency: 369 usec
    Avg gRPC time: 312 usec ((un)marshal request/response 1 usec + response wait 311 usec)
  Server:
    Inference count: 18445
    Execution count: 18445
    Successful request count: 18445
    Avg request latency: 332 usec (overhead 139 usec + queue 168 usec + compute input 5 usec + compute infer 16 usec + compute output 4 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 3060.8 infer/sec, latency 316 usec

batch=256

./perf_analyzer -a -b 256 -u localhost:8001 -i gRPC -m sklearn_fil --concurrency-range 1
*** Measurement Settings ***
  Batch size: 256
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 1108
    Throughput: 56729.6 infer/sec
    Avg latency: 4504 usec (standard deviation 1101 usec)
    p50 latency: 4417 usec
    p90 latency: 4490 usec
    p95 latency: 4688 usec
    p99 latency: 5711 usec
    Avg gRPC time: 4492 usec ((un)marshal request/response 6 usec + response wait 4486 usec)
  Server:
    Inference count: 340992
    Execution count: 1332
    Successful request count: 1332
    Avg request latency: 4390 usec (overhead 2 usec + queue 20 usec + compute input 1 usec + compute infer 4339 usec + compute output 28 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 56729.6 infer/sec, latency 4504 usec

./perf_analyzer -a -b 256 -u localhost:8001 -i gRPC -m sklearn_onnx --concurrency-range 1
*** Measurement Settings ***
  Batch size: 256
  Measurement window: 5000 msec
  Using asynchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 17011
    Throughput: 870963 infer/sec
    Avg latency: 286 usec (standard deviation 120 usec)
    p50 latency: 277 usec
    p90 latency: 311 usec
    p95 latency: 322 usec
    p99 latency: 382 usec
    Avg gRPC time: 283 usec ((un)marshal request/response 5 usec + response wait 278 usec)
  Server:
    Inference count: 5169408
    Execution count: 20193
    Successful request count: 20193
    Avg request latency: 266 usec (overhead 96 usec + queue 19 usec + compute input 5 usec + compute infer 142 usec + compute output 4 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 870963 infer/sec, latency 286 usec

Current Result

When batch=1, FIL performance slightly better than ONNX

But when batch=256，FIL performance is much worse than ONNX

Expected Result

Why does this happen? Did I make a mistake？
Is FIL more suitable for GPU-based machine learning like cuML？

Thanks for submitting this, @DayDayupupupup! Until recently, we have focused on maximizing the FIL backend's performance on GPUs, with CPU-only mode provided primarily for prototyping and quick development tests. We are going to be pushing CPU performance more in the near future by optimizing the upstream library used for CPU inference. Work has already begun there, but I do not have an immediate estimate on when that update is likely to come through.

I just reran your exact scenario but switched KIND_CPU to KIND_GPU for both ONNX and FIL. It is worth noting that FIL offers the greatest benefit for large batch sizes and complex models, but even with the relatively simple model we're using here, we see some improvement in both throughput and latency over ONNX (Quadro GTX8000):

FIL Results

*** Measurement Settings ***                                                                                                                                                                                       
  Batch size: 256                                                                                                                                                                                                  
  Using "time_windows" mode for stabilization                                                                                                                                                                      
  Measurement window: 5000 msec                                                                                                                                                                                    
  Using asynchronous calls for inference                                                                                                                                                                           
  Stabilizing using average latency                                                                                                                                                                                

Request concurrency: 1                                                                                                                                                                                             
  Client:                                                                                                                                                                                                          
    Request count: 10011                                                                                                                                                                                           
    Throughput: 512563 infer/sec                                                                                                                                                                                   
    Avg latency: 478 usec (standard deviation 115 usec)                                                                                                                                                            
    p50 latency: 456 usec                                                                                                                                                                                          
    p90 latency: 601 usec                                                                                                                                                                                          
    p95 latency: 637 usec                                                                                                                                                                                          
    p99 latency: 739 usec                                                                                                                                                                                          
    Avg gRPC time: 478 usec ((un)marshal request/response 10 usec + response wait 468 usec)                                                                                                                        
  Server:                                                                                                                                                                                                          
    Inference count: 3036672                                                                                                                                                                                       
    Execution count: 11862                                                                                                                                                                                         
    Successful request count: 11862                                                                                                                                                                                
    Avg request latency: 245 usec (overhead 2 usec + queue 58 usec + compute input 34 usec + compute infer 51 usec + compute output 100 usec)                                                                      

Inferences/Second vs. Client Average Batch Latency                                                                                                                                                                 
Concurrency: 1, throughput: 512563 infer/sec, latency 478 usec

ONNX Results

*** Measurement Settings ***                                                                                                                                                                                       
  Batch size: 256                                                                                                                                                                                                  
  Using "time_windows" mode for stabilization                                                                                                                                                                      
  Measurement window: 5000 msec                                                                                                                                                                                    
  Using asynchronous calls for inference                                                                                                                                                                           
  Stabilizing using average latency                                                                                                                                                                                

Request concurrency: 1                                                                                                                                                                                             
  Client:                                                                                                                                                                                                          
    Request count: 7993                                                                                                                                                                                            
    Throughput: 409242 infer/sec                                                                                                                                                                                   
    Avg latency: 614 usec (standard deviation 1404 usec)                                                                                                                                                           
    p50 latency: 424 usec                                                                                                                                                                                          
    p90 latency: 773 usec                                                                                                                                                                                          
    p95 latency: 780 usec                                                                                                                                                                                          
    p99 latency: 858 usec                                                                                                                                                                                          
    Avg gRPC time: 620 usec ((un)marshal request/response 7 usec + response wait 613 usec)                                                                                                                         
  Server:                                                                                                                                                                                                          
    Inference count: 2423040                                                                                                                                                                                       
    Execution count: 9465                                                                                                                                                                                          
    Successful request count: 9465                                                                                                                                                                                 
    Avg request latency: 531 usec (overhead 130 usec + queue 37 usec + compute input 9 usec + compute infer 351 usec + compute output 4 usec)                                                                      

Inferences/Second vs. Client Average Batch Latency                                                                                                                                                                 
Concurrency: 1, throughput: 409242 infer/sec, latency 614 usec

For a more complete picture, here is a log-log scatterplot of latency-throughput results for both FIL and ONNX on the model trained in your reproducer code. We can see that the FIL backend does not strictly dominate the ONNX backend (ONNX tends to outperform for very low batch size and concurrency), but as concurrency or batch size increase, FIL outperforms in both throughput and latency. latency_throughput

So in general, when running on GPU for anything except very small loads on simple models, FIL should provide better performance than ONNX. On CPU, we expect to make improvements in the near future.

triton-inference-server / fil_backend

fil_backend's performance is much worse on CPU than onnx backend for sklearn #168