microsoft / onnxruntime-extensions

onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime
MIT License
295 stars 80 forks source link

Performance issue on basic string operations #742

Open github-gauthier-perrod opened 3 weeks ago

github-gauthier-perrod commented 3 weeks ago

Hello, First of all thank you for this amazing project. We have a few questions regarding string manipulation in ONNX runtime extensions. Specifically, we are trying to incorporate simple string manipulations directly in our ONNX models, such as "string_upper" and "string_join".

However, we have observed a significant impact on performance. These operations appear to be unexpectedly expensive as it seems more expensive than matrice multiplication (our string are always rather short, less than 30-40 characters). For instance, adding a "string_upper" operation on five features increases the inference time by a factor of 3 for a batch size of 1, and doubles the inference time for a batch size of 10 on our benchmarks.

Even worse, but expected, when not fully vectorised (for example we want to string upper some features and lower some others) those times adds up practically linearly (processing 1 vector with 10 features is almost 10 times faster than splitting this vector and applying it separately on those 10 features). It can be a real performance issue if we want to apply different simple transforms based on the feature.

We generated a benchmark to show you this you can regenerate using the linked python notebook. Benchmark_example.ipynb.zip

Here is the profiing with split/unsplit string upper operation. We can see that a simple operation like toUpper takes by itself time than the whole MLP.:

Screenshot 2024-06-07 at 11 24 06 Screenshot 2024-06-07 at 11 26 37

We suspect that the high cost may be due to the overhead of copying. Is there a reason we do all those copies ? Why are we obligated to do so ?

PS: We have verified that we are correctly using en_US.utf8.

Have you encountered this issue before? Is this performance impact expected? Could you provide any insights or recommendations on how to optimize these operations? Thanks a lot

wenbingl commented 3 weeks ago

I think the most of time was spent on Python->C++->Python conversion and new/delete objects in C++. The copying you mentioned is to create an output object actually which will be re-used later in the loop. The excessive use of string manipulation has not been thoroughly considered on development, and implementation are based on C++ in a simple way. It might require a fine granularity on CPU profiling to see how we can improve the efficiency which may lead to a complicated C++ implementation.

Is there any real case that a model need more string operations than MatMul and other operations, or is it just a test?

FYI, the input string should UTF-8 encoded.

github-louis-fruleux commented 2 weeks ago

Hello @wenbingl sorry for the delay in the answer,

Actually, this is something we are experiencing in production, the preprocessing of String takes 70% of the CPU of the total. (with unstack, upper, and stack methods) What bothers us is, that string_upper's onnx operator seems slower than Python's equivalent string.upper().

Our issue is in Scala but seems reproducible in Python, so I will just give you a minimal example in Python to reproduce.

Using two benchmarks, one doing the string.upper() in Python, the other one letting string_upper onnx operator doing it. And we've got a consequent difference in terms of performance.

Execution time: 8.37116679904284 +/- 1.2217764614101005 us
Execution time: 35.20608749677194  +/- 1.7959354508589989 us

(here is the benchmark reproducer)

Regarding the performances, I could give it a look using some profiling on the C++ code if you think this could be a valuable contribution! Do you have in mind someone having similar issues? Are we missing something obvious? (such as String encoding or something?)

Thanks for your time and help

wenbingl commented 2 weeks ago

Hello @wenbingl sorry for the delay in the answer,

Actually, this is something we are experiencing in production, the preprocessing of String takes 70% of the CPU of the total. (with unstack, upper, and stack methods) What bothers us is, that string_upper's onnx operator seems slower than Python's equivalent string.upper().

Our issue is in Scala but seems reproducible in Python, so I will just give you a minimal example in Python to reproduce.

Using two benchmarks, one doing the string.upper() in Python, the other one letting string_upper onnx operator doing it. And we've got a consequent difference in terms of performance.

Execution time: 8.37116679904284 +/- 1.2217764614101005 us
Execution time: 35.20608749677194  +/- 1.7959354508589989 us

(here is the benchmark reproducer)

Regarding the performances, I could give it a look using some profiling on the C++ code if you think this could be a valuable contribution! Do you have in mind someone having similar issues? Are we missing something obvious? (such as String encoding or something?)

Thanks for your time and help

Yes, the C++ profiling will be very helpful to see how much time was spent on this upper function, and ORT session. Then we can decide on the next steps.