synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.08k stars 624 forks source link

ExpressionProcessor performance, features, and alternative solutions #610

Open rsivek opened 4 years ago

rsivek commented 4 years ago

I spent some time over the past day or so looking into performance and alternative expression processing solutions for the ExpressionProcessor class which is part of the physiology branch. This will hopefully be merged into master soon via #589. I created local branches with modified ExpressionProcessor implementations which used the mXparser and EvalEx libraries, which are popular math expression processing solutions for Java, and compared the run time results to the existing ExpressionProcessor implementation using the CQL (Clinical Quality Language) Engine which was originally introduced to Synthea in #386.

Here are the results:

image

I was surprised to see that the mXparser solution was actually significantly (~15%) slower than CQL. While I couldn't tell from my profiling exactly why it was slower, I did see some relatively expensive BigDecimal methods and toString() calls being used by mXparser internally. mXparser also has a very extensive list of functions, constants, and other features which may come at a performance cost.

EvalEx, on the other hand, performed about 9% faster than the CQL engine.

An important caveat to note here is that the physiology branch only currently uses the expressions defined in the cardiac hemodynamics generator, and is therefore nowhere near an exhaustive evaluation of the performance of functions and operations available in each solution. It's entirely possible, for example, that a particular EvalEx function we're not currently utilizing takes much much longer to execute than its equivalent in mXparser or CQL, though I think it's unlikely.

One important difference as well with CQL is that it has support and operations for many different data types, including Boolean, String, Date, Decimal and lists of those types. mXparser and EvalEx only accept numeric Decimal arguments, and neither currently supports lists (See https://github.com/mariuszgromada/MathParser.org-mXparser/issues/22 and https://github.com/uklimaschewski/EvalEx/issues/140). While using the list operations in CQL requires constructing Lists of BigDecimal objects, which incurs some cost in performance, having those operations supported by the engine is a rather nice feature. Since we require max and min operations in the expressions for the current circulation hemodynamics generators, my workaround to get these working with mXparser and EvalEx were to preprocess the list operations and provide the results to each of those engines respectively. One benefit of that approach is it allows us to do those list operations directly on the simulation results and avoids the cost of allocating new Lists of values. A significant downside, however, is that we have to implement all of the list operations ourselves. Expression syntax also potentially becomes more complex and difficult to use when we're adding preprocessing functions to the mix.

Note that there are other math expression processing libraries for Java which I have not included in the evaluation. One other in particular which I considered implementing is the exp4j library but the scope is much more limited and would require creation of some custom functions for us to use in the same way we do the other libraries. If someone thinks this option should be considered, however, I can make the effort to add it to the analysis.

Based on the above results, I think we have a decision to make regarding whether we should stick with CQL or switch to using EvalEx. While EvalEx appears to give us greater speed, it will also require more effort to implement and maintain over time due to the requirement for list operations, which may also make syntax a little more complicated. We should also consider potential functionality with using other data types such as Dates and Strings. In CQL, for example, we could potentially pass the GENDER attribute as a String which we could compare using == "M" or == "F" in expressions. This would be impossible using EvalEx since it only works with Decimal numbers.

In all honesty I was expecting the mXparser and EvalEx solutions to perform much faster than the CQL Engine, but was pleasantly surprised that they did not. If nothing else, this exercise has shown that even with its wider focus and scope, CQL Engine appears to perform quite well in comparison to popular expression processing libraries.

Any comments, suggestions, and/or additional data points to consider concerning this issue would be greatly appreciated.

mariuszgromada commented 2 years ago

This is a great summary. Thanks! mXparser, being based on double, implements various 'smart' rounding options (to minimize problems similar to 0.1+0.1+0.1, this is where BigDecimal is sometimes used). Additionally, mXparsert is a highly flexible parser, which comes with a cost. Have you tried expression pre-compilation for efficient loop computation? Additionally, you can turn off the default rounding options - this should speed it up :-) Please see the Tutorial section referring to the efficient calculations in loop - analyse example nr 4 vs example nr 5: https://mathparser.org/mxparser-tutorial/efficient-calculations-in-loops/ Best regards