openfisca / openfisca-core

OpenFisca core engine. See other repositories for countries-specific code & data.
https://openfisca.org
GNU Affero General Public License v3.0
170 stars 75 forks source link

Accelerate simulations using numba #1214

Open IGF-PSD opened 4 months ago

IGF-PSD commented 4 months ago

We found that iterating certain simulations on exhaustive data using openfisca can be very slow. As openfisca relies heavily on numpy arrays, we thought that wrapping some of the package's classes with the numba package's jitclass decorator could greatly speed up calculations.

What do you think of this idea? Would you be interested in a merge request including these changes for a future release of the package?

MattiSG commented 4 months ago

Hi @IGF-PSD and thank you for this idea!

This is a very exciting potential contribution, and we would of course be very happy for performance gains ๐Ÿ˜ƒ We are also very thankful that you took the time to open an issue and start a discussion to best plan such an implementation ๐Ÿ‘

In order to define the best course of action (RFC, direct implementation, level of collaboration, rollout planโ€ฆ), could you share with us:

  1. Concrete use cases where current performance is a bottleneck, in particular the size and type of โ€œexhaustive dataโ€ you mention.
  2. Existing benchmarks that you might have made using the current codebase vs partial implementations with numba you might already have.
  3. Expected impact on the public Python API (and, if any, on the Web API, but as I understand it there should be none).
  4. Expected impact on the private API.
  5. Expected costs of maintenance over time (do the numba decorators need to be updated? how common is the knowledge required to do so?).
  6. Any additional requirements for deployment that would be brought by this change, such as operating system constraints or availability of a GPU on the host machine.

I understand this is a lot of additional information. If you prefer, we could also set up a synchronous video call to discuss these points and best understand your context ๐Ÿ™‚

I would also love to hear from community technical experts such as @benjello @benoit-cty @bonjourmauko @Morendil @nikhilwoodruff @eraviart if they have experience, opinions or insights on using numba or alternatives on the OpenFisca codebase ๐Ÿ™‚

benoit-cty commented 4 months ago

Hello, this will be great ! At LexImpact, @clallemand achieve to compute the revenue taxes on exhaustive data (POTE dataset : 39M lines and more than 1K columns) in less than an hour, using caching capacities of OpenFisca to avoid recomputing all variables each time.

IGF-PSD commented 3 months ago

Hello,

Thank you very much for your feedback!

  1. We noticed these delays while working on different scenarios for reforming employer social security contributions. Simulating a scenario for each month of the year could easily take an hour. Even using the caching capabilities of openfisca, these times seemed too slow to implement, for example, an under-constrained optimization program to find the optimum thresholds associated with each scenario within the framework of our micro-simulation model or use a sufficiently fine parameter grid.

  2. Without being able to "connect" openfiscaand numbaas the Period, Entity, Parameters, Variableetc. types are not natively compatible with numba, our first tests consisted in comparing a formula vectorized using numpy (which de facto ignores any optimizations included in openfisca to speed up this type of operation, but uses the same approach to perform the calculations) and its equivalent using numba. To calculate, for example, general tax relief (which is a reduction in employer contributions contributions implemented in openfisca-france) on a simulated population of 20 million individuals, the vectorized formula takes 10.8 seconds to iterate, whereas its "numba" equivalent (sometimes using a functional form, or using @jitclass) iterates on the same population in the same population in 200 ms. Performance gains are therefore of the order of 50 on this example.

3-4. Integrating numba with openfisca would have a certain number of consequences for the package and its syntax:

  1. We've been using numba for several years and the syntax has never really changed over time. Apart from the class attributes we mentioned earlier, we don't see any potential future maintenance costs.
  2. There are no operating system constraints associated with this modification.

However, we are very open to further discuss these potential changes !

MattiSG commented 3 months ago

Thank you very much for this detailed answer! A 50ร— compute performance gain sounds exciting ๐Ÿ˜ƒ The main point left for me to fully understand before considering an upgrade path is whether this can be a drop-in replacement or entails rewrites.

  1. To calculate, for example, general tax relief

Just to clarify: I understand that you extracted the code for that formula from OpenFisca (which makes total sense to get rid of the underlying non-numba classes), meaning that there was no caching. I thus assume that the formula was not recursive, as this would significantly change the assessment (i.e. maybe compute costs are insignificant compared to cache miss costs). Please let me know if that is not the case ๐Ÿ™‚

In order to better understand the cost of upgrading and the differences in syntax, would you be able to share the two formula implementations you used for this benchmark? This would be very useful for the community to assess the differences.

we'll need to rewrite the variables for all socio-fiscal systems derived from openfisca

If every dependent needs to rewrite every variable, this is a major drawback as the upgrade cost will be immense. To be sure: do you mean that rewriting the variables is an opt-in to enable numba optimisation on a per-variable basis, i.e. that variables as they are will still work but will not be optimised, or that once the change is brought every variable will have to be rewritten to be compatible with the numba-optimised base classes?

the current version of jitclass is not compatible with cuda and GPU parallelization [โ€ฆ] we don't know to what extent implementing these jitclass may prove inconsistent with your other optimization projects in the package.

To my knowledge, we don't currently have any GPU optimisation in place in OpenFisca Core and have no resources secured for such an addition, so there would not be a competing implementation ๐Ÿ™‚

bonjourmauko commented 3 months ago

Hi @IGF-PSD and @MattiSG! Indeed, I've been testing out improving OpenFisca with Cython for a while now. I did some tests with numba a while ago, and I know that @eraviart has been also trying to integrate cuda.

Just a concern, however: numba is a compiler, and we do not have the resources to ship compiled bins for every platform -unless that is what conda does already @benoit-cty @sandcha?- so, at best, what we can do, is to improve the actual implementation of openfisca "primitives" so they can profit from just-in-time optimisation.

As I've been pushing mudularisation and immutability for some time, some work has already been shipped, and some other is on the WIP queue. And making code optimisible is not that hard: it is a matter of extracting from the performance bottlenecks the objects that can't be compiled natively in C/C++.

So, it is above all, a design effort, pure compiler-aware refactoring.

bonjourmauko commented 3 months ago

BTW, I know traction for typing is not huge yet within the community, yet type-hinting is just about soft-enforcing good design that allows for compiler optimisations.

See, for example, a header file for Cython, that is, well, very similar to a Python type:

from retrievals import Page, PageParser

ctypedef object Page_t
ctypedef object PageParser_t

cdef class DocSplitter:
    cpdef list[Page_t] split(self, PageParser_t parser, TextSplitter text_splitter)

cdef class TextSplitter:
    cdef public int chunk_size
    cdef public int chunk_overlap
    cdef char* separator
    cdef object separator_pattern

    cpdef list[char*] split(self, char* text)
    cpdef list[char*] merge(self, list[char*] splits)
    cdef list[char*] get_current_chunk(self, list current_chunk)
    cdef int get_current_len(self, list[char*] current_chunk)

If we just enforce some of these constraints and export pure ADTs, users will be able to easily perform jit, aof, use cuda, or whatever atomic optimisation they want.

bonjourmauko commented 3 months ago

Another, probably harder up-front, but cheaper long-tail, option, that I think we discussed with @cbenz, is to generalise dependency injection of the classes that can be implemented with primitivy C types.

See, for example:

    def calculate_add(self, variable_name: str, period):
        variable: Optional[Variable]

        variable = self.tax_benefit_system.get_variable(
            variable_name, check_existence=True
        )

        if variable is None:
            raise errors.VariableNotFoundError(variable_name, self.tax_benefit_system)

        if period is not None and not isinstance(period, periods.Period):
            period = periods.period(period)

        # Check that the requested period matches definition_period
        if periods.unit_weight(variable.definition_period) > periods.unit_weight(
            period.unit
        ):
            raise ValueError(
                f"Unable to compute variable '{variable.name}' for period "
                f"{period}: '{variable.name}' can only be computed for "
                f"{variable.definition_period}-long periods. You can use the "
                f"DIVIDE option to get an estimate of {variable.name}."
            )

        if variable.definition_period not in (
            periods.DateUnit.isoformat + periods.DateUnit.isocalendar
        ):
            raise ValueError(
                f"Unable to ADD constant variable '{variable.name}' over "
                f"the period {period}: eternal variables can't be summed "
                "over time."
            )

        return sum(
            self.calculate(variable_name, sub_period)
            for sub_period in period.get_subperiods(variable.definition_period)
        )

There is no way to optimise that without extracting dependencies from the function:

   def guard(variable: Just[Variable], period: Just[Period]) -> TypeGuard[Match[Variable, Period]]:
        # Check that the requested period matches definition_period
        if Period.unit_weight(variable.definition_period) <= Period.unit_weight(period.unit):
            return True

        raise ValueError(
            f"Unable to compute variable '{variable.name}' for period "
            f"{period}: '{variable.name}' can only be computed for "
            f"{variable.definition_period}-long periods. You can use the "
            f"DIVIDE option to get an estimate of {variable.name}."
        )

    # etc...

     @numba.jit(forceobj=True)
     def sub_periods(variable: Just[Variable], period: Just[Period]) -> Just[Sequence[Tuple[int]]]:
         return set(sub_period.value for sub_period in period.get_subperiods(variable.definition_period))

    @numba.jit(nopython=True)
    def calculate_add(calc: [[str, tuple], float], tuple: Just[Match[float, set[tuple[int]]]]):
        return sum(map, calc, [v_value, p_values]))

Something like that.

bonjourmauko commented 1 month ago

Here is a working experiment. As you can see, 75% of the exercise is in typing:

@guvectorize(
    [(float32[:], float32[:], float32[:], float32[:])], "(n),(o),(p)->(n)"
)  # pyright: ignore[reportUnknownMemberType,reportUntypedFunctionDecorator]
def apply_thresholds(
    input: t.FloatArray,
    thresholds: t.FloatArray,
    choices: t.FloatArray,
    result: t.FloatArray,
) -> None:
    """Make a choice based on an input and thresholds.

    From a list of ``choices``, this function selects one of these values
    based on a list of inputs, depending on the value of each ``input`` within
    a list of ``thresholds``.

    Args:
        input: A list of inputs to make a choice from.
        thresholds: A list of thresholds to choose.
        choices: A list of the possible values to choose from.
        result: A list of the values chosen.

    Raises:
        ValueError: When the number of ``thresholds`` is not equal to the
            number of ``choices``, or one less.

    Examples:
        >>> input = numpy.array([4, 5, 6, 7, 8], dtype=numpy.float32)
        >>> thresholds = numpy.array([5, 7], dtype=numpy.float32)
        >>> choices = numpy.array([10, 15, 20], dtype=numpy.float32)
        >>> apply_thresholds(input, thresholds, choices)
        array([10., 10., 15., 15., 20.], dtype=float32)

    """
    condlist = [numpy.array([x], dtype=t.BoolDType) for x in range(0)]

    for threshold in thresholds:
        condlist.append(input <= threshold)  # noqa: PERF401

    choicelist = [numpy.array([x], dtype=t.FloatDType) for x in range(0)]

    for choice in choices:
        choicelist.append(numpy.array([choice], dtype=t.FloatDType))  # noqa: PERF401

    if len(condlist) == len(choicelist) - 1:
        # If a choice is provided for input > highest threshold, last condition
        # must be true to return it.
        condlist.append(numpy.array([True], dtype=t.BoolDType))

    if len(condlist) != len(choicelist):
        msg = (
            "'apply_thresholds' must be called with the same number of "
            "thresholds than choices, or one more choice."
        )
        raise ValueError(msg)

    array = numpy.select(condlist, choicelist)

    for i in range(array.size):
        result[i] = array[i]

It has one big feature that I wasn't expecting, which is being able to define native numpy.ufunc and numpy.gufunc (the example above is a gufunc).

bonjourmauko commented 1 month ago

It is not that bad, it looks like C in the logic but it doesn't impose a DSL like Cython.

And the compilation is jit, so we don't have to worry about distributing compiled binaries.

I'd actually pick one bottleneck method, fully type it, make it pure, and try how it goes with numba.

I remember @guillett @sandcha and @Morendil studied the bottlenecks.

For general optimisations (those excluding extending numpy) I'd rather rewrite them in rust with python bindings.