Closed DayDayupupupup closed 2 years ago
Thanks for submitting this, @DayDayupupupup! Until recently, we have focused on maximizing the FIL backend's performance on GPUs, with CPU-only mode provided primarily for prototyping and quick development tests. We are going to be pushing CPU performance more in the near future by optimizing the upstream library used for CPU inference. Work has already begun there, but I do not have an immediate estimate on when that update is likely to come through.
I just reran your exact scenario but switched KIND_CPU
to KIND_GPU
for both ONNX and FIL. It is worth noting that FIL offers the greatest benefit for large batch sizes and complex models, but even with the relatively simple model we're using here, we see some improvement in both throughput and latency over ONNX (Quadro GTX8000):
FIL Results
*** Measurement Settings ***
Batch size: 256
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using asynchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 10011
Throughput: 512563 infer/sec
Avg latency: 478 usec (standard deviation 115 usec)
p50 latency: 456 usec
p90 latency: 601 usec
p95 latency: 637 usec
p99 latency: 739 usec
Avg gRPC time: 478 usec ((un)marshal request/response 10 usec + response wait 468 usec)
Server:
Inference count: 3036672
Execution count: 11862
Successful request count: 11862
Avg request latency: 245 usec (overhead 2 usec + queue 58 usec + compute input 34 usec + compute infer 51 usec + compute output 100 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 512563 infer/sec, latency 478 usec
ONNX Results
*** Measurement Settings ***
Batch size: 256
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using asynchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 7993
Throughput: 409242 infer/sec
Avg latency: 614 usec (standard deviation 1404 usec)
p50 latency: 424 usec
p90 latency: 773 usec
p95 latency: 780 usec
p99 latency: 858 usec
Avg gRPC time: 620 usec ((un)marshal request/response 7 usec + response wait 613 usec)
Server:
Inference count: 2423040
Execution count: 9465
Successful request count: 9465
Avg request latency: 531 usec (overhead 130 usec + queue 37 usec + compute input 9 usec + compute infer 351 usec + compute output 4 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 409242 infer/sec, latency 614 usec
For a more complete picture, here is a log-log scatterplot of latency-throughput results for both FIL and ONNX on the model trained in your reproducer code. We can see that the FIL backend does not strictly dominate the ONNX backend (ONNX tends to outperform for very low batch size and concurrency), but as concurrency or batch size increase, FIL outperforms in both throughput and latency.
So in general, when running on GPU for anything except very small loads on simple models, FIL should provide better performance than ONNX. On CPU, we expect to make improvements in the near future.
Thanks for your detailed answer, @wphicks . Expect FIL's CPU performance improvements in the future!
@DayDayupupupup A brief update on this: The upcoming release 22.03 will have some CPU performance improvements, and #203 (likely to be included in 22.04) has significantly more. With #203, the same test that we performed before (but this time comparing CPU to CPU) gives us the following results (now presented in a somewhat cleaner form):
For the extremely low-latency domain (< 1ms), the ONNX backend still prevails on CPU, but we suspect that this is not due to the speed of execution for the underlying libraries but due to this issue: https://github.com/rapidsai/rapids-triton/issues/22. We'll look into that as well and continue to try to push low-latency performance as well as continuing to optimize performance across all deployment scenarios.
Brief description
I have a SkLearn model and deploy the model on the CPU using FIL and ONNX backend respectively. When comparing inference performance using the Perf_Analyzer tool, FIL's performance was significantly worse than ONNX's when Batch =256
Environment
Steps to Reproduce the Issue,everything is done inside the container
df = pd.read_csv('Breast_cancer_data.csv') X = df[['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']] y = df['diagnosis'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
define the model
model = RandomForestClassifier()
fit the model on the whole dataset
model.fit(X_train, y_train) ypred = model.predict(X_test) print("Predicted Class:", ypred) y_proba = model.predict_proba(X_test) print("Predicted proba:", y_proba) import pickle with open("model.pkl", 'wb') as model_file: pickle.dump(model, model_file)
initial_type = [('input', FloatTensorType([None, 5]))] onx = convert_sklearn(model, initial_types=initial_type, verbose=2, target_opset=12, options={'zipmap': False}) output_map = {output.name: output for output in onx.graph.output}
delete output label
onx.graph.output.remove(output_map['label'])
with open(onnx_model_path, "wb") as f: f.write(onx.SerializeToString())
Request concurrency: 1 Client: Request count: 19235 Throughput: 3847 infer/sec Avg latency: 253 usec (standard deviation 489 usec) p50 latency: 249 usec p90 latency: 256 usec p95 latency: 264 usec p99 latency: 283 usec Avg gRPC time: 250 usec ((un)marshal request/response 0 usec + response wait 250 usec) Server: Inference count: 23145 Execution count: 23145 Successful request count: 23145 Avg request latency: 204 usec (overhead 1 usec + queue 161 usec + compute input 1 usec + compute infer 20 usec + compute output 21 usec)
Inferences/Second vs. Client Average Batch Latency Concurrency: 1, throughput: 3847 infer/sec, latency 253 usec
batch=256
Current Result
When batch=1, FIL performance slightly better than ONNX
But when batch=256,FIL performance is much worse than ONNX
Expected Result