Open IGF-PSD opened 4 months ago
Hi @IGF-PSD and thank you for this idea!
This is a very exciting potential contribution, and we would of course be very happy for performance gains ๐ We are also very thankful that you took the time to open an issue and start a discussion to best plan such an implementation ๐
In order to define the best course of action (RFC, direct implementation, level of collaboration, rollout planโฆ), could you share with us:
I understand this is a lot of additional information. If you prefer, we could also set up a synchronous video call to discuss these points and best understand your context ๐
I would also love to hear from community technical experts such as @benjello @benoit-cty @bonjourmauko @Morendil @nikhilwoodruff @eraviart if they have experience, opinions or insights on using numba or alternatives on the OpenFisca codebase ๐
Hello, this will be great ! At LexImpact, @clallemand achieve to compute the revenue taxes on exhaustive data (POTE dataset : 39M lines and more than 1K columns) in less than an hour, using caching capacities of OpenFisca to avoid recomputing all variables each time.
Hello,
Thank you very much for your feedback!
We noticed these delays while working on different scenarios for reforming employer social security contributions. Simulating a scenario for each month of the year could easily take an hour. Even using the caching capabilities of openfisca
, these times seemed too slow to implement, for example, an under-constrained optimization program to find the optimum thresholds associated with each scenario within the framework of our micro-simulation model or use a sufficiently fine parameter grid.
Without being able to "connect" openfisca
and numba
as the Period
, Entity
, Parameters
, Variable
etc. types are not natively compatible with numba
, our first tests consisted in comparing a formula vectorized using numpy (which de facto ignores any optimizations included in openfisca to speed up this type of operation, but uses the same approach to perform the calculations) and its equivalent using numba.
To calculate, for example, general tax relief (which is a reduction in employer contributions contributions implemented in openfisca-france) on a simulated population of 20 million individuals, the vectorized formula takes 10.8 seconds to iterate, whereas its "numba" equivalent (sometimes using a functional form, or using @jitclass
) iterates on the same population in the same population in 200 ms. Performance gains are therefore of the order of 50 on this example.
3-4. Integrating numba with openfisca would have a certain number of consequences for the package and its syntax:
Period
, Entity
, Parameters
, Variable
, set_input
, calculate_output
to be made compatible with numba
. We don't know openfisca in detail, but to date we haven't seen any prohibitive obstacles in the operation of these classes and methods that would prevent such compatibility. This is undoubtedly a laborious stage, but we can work on it on our side.However, we are very open to further discuss these potential changes !
Thank you very much for this detailed answer! A 50ร compute performance gain sounds exciting ๐ The main point left for me to fully understand before considering an upgrade path is whether this can be a drop-in replacement or entails rewrites.
- To calculate, for example, general tax relief
Just to clarify: I understand that you extracted the code for that formula from OpenFisca (which makes total sense to get rid of the underlying non-numba classes), meaning that there was no caching. I thus assume that the formula was not recursive, as this would significantly change the assessment (i.e. maybe compute costs are insignificant compared to cache miss costs). Please let me know if that is not the case ๐
In order to better understand the cost of upgrading and the differences in syntax, would you be able to share the two formula implementations you used for this benchmark? This would be very useful for the community to assess the differences.
we'll need to rewrite the variables for all socio-fiscal systems derived from openfisca
If every dependent needs to rewrite every variable, this is a major drawback as the upgrade cost will be immense. To be sure: do you mean that rewriting the variables is an opt-in to enable numba optimisation on a per-variable basis, i.e. that variables as they are will still work but will not be optimised, or that once the change is brought every variable will have to be rewritten to be compatible with the numba-optimised base classes?
the current version of jitclass is not compatible with cuda and GPU parallelization [โฆ] we don't know to what extent implementing these jitclass may prove inconsistent with your other optimization projects in the package.
To my knowledge, we don't currently have any GPU optimisation in place in OpenFisca Core and have no resources secured for such an addition, so there would not be a competing implementation ๐
Hi @IGF-PSD and @MattiSG! Indeed, I've been testing out improving OpenFisca with Cython
for a while now. I did some tests with numba
a while ago, and I know that @eraviart has been also trying to integrate cuda
.
Just a concern, however: numba
is a compiler, and we do not have the resources to ship compiled bins for every platform -unless that is what conda
does already @benoit-cty @sandcha?- so, at best, what we can do, is to improve the actual implementation of openfisca "primitives" so they can profit from just-in-time optimisation.
As I've been pushing mudularisation and immutability for some time, some work has already been shipped, and some other is on the WIP queue. And making code optimisible is not that hard: it is a matter of extracting from the performance bottlenecks the objects that can't be compiled natively in C/C++.
So, it is above all, a design effort, pure compiler-aware refactoring.
BTW, I know traction for typing is not huge yet within the community, yet type-hinting is just about soft-enforcing good design that allows for compiler optimisations.
See, for example, a header
file for Cython
, that is, well, very similar to a Python type:
from retrievals import Page, PageParser
ctypedef object Page_t
ctypedef object PageParser_t
cdef class DocSplitter:
cpdef list[Page_t] split(self, PageParser_t parser, TextSplitter text_splitter)
cdef class TextSplitter:
cdef public int chunk_size
cdef public int chunk_overlap
cdef char* separator
cdef object separator_pattern
cpdef list[char*] split(self, char* text)
cpdef list[char*] merge(self, list[char*] splits)
cdef list[char*] get_current_chunk(self, list current_chunk)
cdef int get_current_len(self, list[char*] current_chunk)
If we just enforce some of these constraints and export pure ADTs, users will be able to easily perform jit
, aof
, use cuda
, or whatever atomic optimisation they want.
Another, probably harder up-front, but cheaper long-tail, option, that I think we discussed with @cbenz, is to generalise dependency injection of the classes that can be implemented with primitivy C types.
See, for example:
def calculate_add(self, variable_name: str, period):
variable: Optional[Variable]
variable = self.tax_benefit_system.get_variable(
variable_name, check_existence=True
)
if variable is None:
raise errors.VariableNotFoundError(variable_name, self.tax_benefit_system)
if period is not None and not isinstance(period, periods.Period):
period = periods.period(period)
# Check that the requested period matches definition_period
if periods.unit_weight(variable.definition_period) > periods.unit_weight(
period.unit
):
raise ValueError(
f"Unable to compute variable '{variable.name}' for period "
f"{period}: '{variable.name}' can only be computed for "
f"{variable.definition_period}-long periods. You can use the "
f"DIVIDE option to get an estimate of {variable.name}."
)
if variable.definition_period not in (
periods.DateUnit.isoformat + periods.DateUnit.isocalendar
):
raise ValueError(
f"Unable to ADD constant variable '{variable.name}' over "
f"the period {period}: eternal variables can't be summed "
"over time."
)
return sum(
self.calculate(variable_name, sub_period)
for sub_period in period.get_subperiods(variable.definition_period)
)
There is no way to optimise that without extracting dependencies from the function:
def guard(variable: Just[Variable], period: Just[Period]) -> TypeGuard[Match[Variable, Period]]:
# Check that the requested period matches definition_period
if Period.unit_weight(variable.definition_period) <= Period.unit_weight(period.unit):
return True
raise ValueError(
f"Unable to compute variable '{variable.name}' for period "
f"{period}: '{variable.name}' can only be computed for "
f"{variable.definition_period}-long periods. You can use the "
f"DIVIDE option to get an estimate of {variable.name}."
)
# etc...
@numba.jit(forceobj=True)
def sub_periods(variable: Just[Variable], period: Just[Period]) -> Just[Sequence[Tuple[int]]]:
return set(sub_period.value for sub_period in period.get_subperiods(variable.definition_period))
@numba.jit(nopython=True)
def calculate_add(calc: [[str, tuple], float], tuple: Just[Match[float, set[tuple[int]]]]):
return sum(map, calc, [v_value, p_values]))
Something like that.
Here is a working experiment. As you can see, 75% of the exercise is in typing:
@guvectorize(
[(float32[:], float32[:], float32[:], float32[:])], "(n),(o),(p)->(n)"
) # pyright: ignore[reportUnknownMemberType,reportUntypedFunctionDecorator]
def apply_thresholds(
input: t.FloatArray,
thresholds: t.FloatArray,
choices: t.FloatArray,
result: t.FloatArray,
) -> None:
"""Make a choice based on an input and thresholds.
From a list of ``choices``, this function selects one of these values
based on a list of inputs, depending on the value of each ``input`` within
a list of ``thresholds``.
Args:
input: A list of inputs to make a choice from.
thresholds: A list of thresholds to choose.
choices: A list of the possible values to choose from.
result: A list of the values chosen.
Raises:
ValueError: When the number of ``thresholds`` is not equal to the
number of ``choices``, or one less.
Examples:
>>> input = numpy.array([4, 5, 6, 7, 8], dtype=numpy.float32)
>>> thresholds = numpy.array([5, 7], dtype=numpy.float32)
>>> choices = numpy.array([10, 15, 20], dtype=numpy.float32)
>>> apply_thresholds(input, thresholds, choices)
array([10., 10., 15., 15., 20.], dtype=float32)
"""
condlist = [numpy.array([x], dtype=t.BoolDType) for x in range(0)]
for threshold in thresholds:
condlist.append(input <= threshold) # noqa: PERF401
choicelist = [numpy.array([x], dtype=t.FloatDType) for x in range(0)]
for choice in choices:
choicelist.append(numpy.array([choice], dtype=t.FloatDType)) # noqa: PERF401
if len(condlist) == len(choicelist) - 1:
# If a choice is provided for input > highest threshold, last condition
# must be true to return it.
condlist.append(numpy.array([True], dtype=t.BoolDType))
if len(condlist) != len(choicelist):
msg = (
"'apply_thresholds' must be called with the same number of "
"thresholds than choices, or one more choice."
)
raise ValueError(msg)
array = numpy.select(condlist, choicelist)
for i in range(array.size):
result[i] = array[i]
It has one big feature that I wasn't expecting, which is being able to define native numpy.ufunc
and numpy.gufunc
(the example above is a gufunc
).
It is not that bad, it looks like C
in the logic but it doesn't impose a DSL like Cython
.
And the compilation is jit
, so we don't have to worry about distributing compiled binaries.
I'd actually pick one bottleneck method, fully type it, make it pure, and try how it goes with numba
.
I remember @guillett @sandcha and @Morendil studied the bottlenecks.
For general optimisations (those excluding extending numpy) I'd rather rewrite them in rust
with python
bindings.
We found that iterating certain simulations on exhaustive data using openfisca can be very slow. As openfisca relies heavily on
numpy arrays
, we thought that wrapping some of the package's classes with the numba package's jitclass decorator could greatly speed up calculations.What do you think of this idea? Would you be interested in a merge request including these changes for a future release of the package?