Possible performance regression / potential optimisation

desal commented 11 months ago

I've been looking at dropping in prima as a replacement for dlib's BOBYQA implementation. I've noticed that prima is significantly slower for comparable number of evaluations. I can appreciate that the time spent in the optimiser itself is typically small compared to typical cost of evaluating the objective function, and correctness/maintainability of the code is worth paying some performance penalty. However, in my use case the overhead (i.e. time spent in optimiser outside of objective function) went from ~5% to ~50%. It's also not a strictly fair comparison, as they aren't bug-for-bug compatible and are going to have differences in stopping conditions, path dependency, etc.

I'm not sure if this is a priority right now (as my use case maybe somewhat extreme in how fast the objective function runs), but at least for me having this a lot closer would mean that I could start to consider Prima as a by-default choice for any problems.

Here's a (rather contrived) quick and dirty example illustrating the difference:

#include <cstdio>
#include <cmath>
#include <array>
#include <random>
#include <chrono>

#include "prima/prima.h"

double fun_time = 0.0;

static void fun(const double x[], double *f, const void *data) {
    auto start = std::chrono::steady_clock::now();
    const double* const y = (double*)data;
    *f = 0.0;
    for (int i = 0; i < 10; ++i) {
        *f += std::abs(x[i] - y[i]);
    }
    auto tdiff = std::chrono::steady_clock::now() - start;
    fun_time += tdiff.count();
}

int main(int argc, char * argv[])
{
    (void)argc;
    (void)argv;

    std::mt19937 gen(1977);
    std::uniform_real_distribution<> dis(0.0, 1.0);

    const int n = 10;
    std::array<double, 10> x;
    std::array<double, 10> y;
    std::array<double, 10> xl;
    std::array<double, 10> xu;
    int total_nf = 0;
    double total_f = 0.0;
    double t = 0.0;
    for (int i = 0; i < 1000; ++i) {
        for (int j = 0; j < n; ++j) {
            x[j] = dis(gen);
            y[j] = dis(gen);
            xl[j] = x[j] - 0.1;
            xu[j] = x[j] + 0.1;
        }
        double f = 0.0;
        const double rhobeg = 0.01;
        const double rhoend = 0.0001;
        const double ftarget = 0.0;
        const int iprint = PRIMA_MSG_NONE;
        const int maxfun = 1000;
        const int npt = 2*n+1;
        int nf = 0;
        void *data = (void*) y.data();
        auto start = std::chrono::steady_clock::now();
        const int rc = prima_bobyqa(&fun, data, n, x.data(), &f, xl.data(), xu.data(), &nf, rhobeg, rhoend, ftarget, maxfun>
        auto tdiff = std::chrono::steady_clock::now() - start;
        total_f += f;
        total_nf += nf;
        t += tdiff.count();
    }
    printf("total = %g over %d in %gs (%gms in fun, %gns per eval)\n",
        total_f, total_nf, t*1e-9, fun_time*1e-6, fun_time/double(total_nf));
    return 0;
}

and with dlib:

#include <dlib/optimization.h>
#include <random>
#include <cstdio>
#include <chrono>

using namespace dlib;

typedef matrix<double,0,1> column_vector;

int main() {
    mt19937 gen(1977);
    std::uniform_real_distribution<> dis(0.0, 1.0);
    column_vector x(10);
    column_vector y(10);
    column_vector xl(10);
    column_vector xu(10);
    int N = 10;
    int total_nf = 0;
    double total_f = 0.0;
    double t = 0.0;
    double fun_time = 0.0;

    for (int i = 0; i < 1000; ++i) {
        for (int j = 0; j < 10; ++j) {
            x(j) = dis(gen);
            y(j) = dis(gen);
            xl(j) = x(j) - 0.1;
            xu(j) = x(j) + 0.1;
        }

        auto fun = [&](const column_vector& z) {
            auto start = std::chrono::steady_clock::now();
            double f = 0.0;
            for (int j = 0; j < 10; ++j)
                f += std::abs(z(j) - y(j));
            total_nf += 1;
            auto tdiff = std::chrono::steady_clock::now() - start;
            fun_time += tdiff.count();
            return f;
        };

        auto start = std::chrono::steady_clock::now();
        total_f += find_min_bobyqa(fun,
                        x,
                        2*N+1,  // number of interpolation points
                        xl,     // lower bound constraint
                        xu,     // upper bound constraint
                        0.01,   // initial trust region radius
                        0.0001, // stopping trust region radius
                        1000    // max number of objective function evaluations
        );
        auto tdiff = std::chrono::steady_clock::now() - start;
        t += tdiff.count();
    }

    printf("total = %g over %d in %gs (%gms in fun, %gns per eval)\n",
        total_f, total_nf, t*1e-9, fun_time*1e-6, fun_time/double(total_nf));
}

which gives me the following results:

prima:
  total = 2432.71 over 100093 in 2.02466s (2.4082ms in fun, 24.0596ns per eval)
dlib:
  total = 2432.2 over 97556 in 0.251892s (2.29799ms in fun, 23.5557ns per eval)

I've done some basic tweaking of optimisation flags, but didn't make much of a difference either way.

zaikunzhang commented 10 months ago

Hello @desal , thank you so much for your interest in adopting PRIMA in dlib, and for raising your concern.

Apologies for my long-delayed response.

A short answer to your question: PRIMA pays more attention to reducing the number of function evaluations and improving the code clarity & simplicity, even if this is at the cost of more flop taken inside the solvers.

Why is this reasonable? PRIMA is designed as a package for derivative-free optimization problems, where the cost is dominated by function evaluations, not the flop inside the solvers. In these problems, each function evaluation may take several minutes, hours, or days, and hence it makes little sense to care about how to save the flop and reduce the internal computational time by several milliseconds per iteration.

Of course, this does not mean that PRIMA does not care about the internal cost of solvers. This just means that there is a priority:

Reducing number of function evaluations >> improving code simplicity & clarity >> saving internal cost of solvers.

In your test, as you measured, each function evaluation takes an extremely short time. Therefore, the computing time is dominated by the internal cost of the solvers. However, this is normally not the case in practice --- or I should say, PRIMA is not designed for such a situation.

If your interest is to solve problems where the function evaluation is cheap, then Powell's solver may not be the correct choice. If you decide to adopt Powell's solvers while concerned about the internal cost of the solvers, then there are two possibilities.

Stay with the original F77 implementation (or its C translation) of the solvers, tolerating the fact that they are impossible to maintain and contain bugs.
Consider modifying PRIMA to utilize BLAS for the underlying matrix-verctor procedures (matprod, inprod, etc; you only need to consider the procedures in fortran/common/linalg.f90). However, I am not sure whether such a version will be much more efficient (in terms of internal costs of the solvers) than the current one.

To elaborate things better, I have posted a discussion. I hope it is not too long to read (the length partially explains the delay). I will be very glad to see your comments there.

zaikunzhang commented 10 months ago

Given the above comments, the results of your test are still a bit strange to me.

What kind of flags did you provide to the compiler when compiling PRIMA?
Did you by any chance set PRIMA_DEBUGGING to 1 in fortran/common/ppf.h? Such a setting will be expensive and should not be used in production.

libprima / prima

Possible performance regression / potential optimisation #130