Performance benchmark for hpp-fcl

Background

I come from the RSS 2022 paper, and I think your work is interesting. However, my preliminary benchmark results showed that hpp-fcl is not as fast as OpenGJK and libccd. I was computing the separating distance between two convex polytopes, one has over 500 vertices, the other has around 350 vertices.

OpenGJK took around 1us, using linear time support function (so it can potentially be much faster if we were to use log time support function), libccd (which uses MPR) took around 1.5us. hpp-fcl's default GJK took 2.5us, GJK-nesterov took 3us.

I used Google's benchmark library to record time mentioned above.

My result contradicts to what was claimed in #313 and the performance benchmark in the paper: Screenshot from 2022-07-14 18-46-20

Possible Reasons

Hardware issues.

The RSS paper did not give any details on the hardware used, I wish they can be disclosed here.

What I am using is a PC with Intel® Core™ i7-8700K CPU @ 3.70GHz × 12, and 32GB of memory. The system is Ubuntu20.04.3 LTS

My Implementation is wrong

I paste a simplified version of my code here for reference. My implementation largely follows the test files. If needed, I can provide all my source files.

#include <hpp/fcl/narrowphase/gjk.h>
#include <hpp/fcl/shape/convex.h>
#include <hpp/fcl/shape/geometric_shapes.h>

using hpp::fcl::support_func_guess_t;
using hpp::fcl::Transform3f;
using hpp::fcl::Vec3f;
using hpp::fcl::details::GJK;
using hpp::fcl::details::MinkowskiDiff;

std::shared_ptr<hpp::fcl::Convex<hpp::fcl::Triangle>>
createConvexShape(const std::string& file_name)
{
  load_points_and_tris_from_file(file_name);
  auto convex_shape = std::make_shared<hpp::fcl::Convex<hpp::fcl::Triangle>>(
      true,
      pts,                                           // Suppose that I have loaded all the points
      static_cast<unsigned int>(number_of_vertices), // num_points
      tris,                                          // Suppose that I have loaded all the triangles
      static_cast<unsigned int>(number_of_faces)     // number of triangles
      );
  convex_shape->computeLocalAABB();
  return convex_shape;
}

int main()
{
  auto cvx0 = createConvexShape("shape0");
  auto cvx1 = createConvexShape("shape1");

  Transform3f cvx1_transform{Transform3f::Identity()};
  cvx1_transform.translation() << 0, 0, 1.5;

  const unsigned int max_iterations = 128;
  const hpp::fcl::FCL_REAL tolerance = 1e-6;

  GJK gjk(max_iterations, tolerance);

  // ------ Timing starts here ------
  MinkowskiDiff mink_diff;
  mink_diff.set(cvx0.get(), cvx1.get(), Transform3f::Identity(),
                cvx1_transform);

  Vec3f init_guess = Vec3f(1, 0, 0);
  support_func_guess_t init_support_guess;
  init_support_guess.setZero();
  gjk.evaluate(mink_diff, init_guess, init_support_guess);
  // ------ Timing ends here ------
}

I would be most grateful if someone is able to help me achieve the performance claimed in the paper.

Dear @OxDuke,

Thanks a lot for your feedback. It is pretty helpful for us in order to improve hpp-fcl.

From a first perspective, your benchmark is really simple. Maybe a bit too much to really mean something. If you looked at the one done in the R:SS paper, their are runned over thousands of random configurations. In addition, we are comparing not only the timings, but surely the number of iterations between GJK and its accelerated versions. What is sure from our side, it is that there is a lot of memory allocations within hpp-fcl, that we inherrit in fact from the original project and we are aiming to remove.

From a benchmark perspective, one should decouple the algorithm and their implementations. It would be nice if you can provide the benchmarks against libccd and OpenGJK. Certainly, their implementations is much more tailored than ours.

To move forward, could you share your entire code?

Thanks in advance, Justin

Hello @OxDuke,

Thank you for taking the time to make these comparisons! As @jcarpent wrote, it would be great if you could share your entire code so that we can make fair comparisons with the results presented in the paper. A few things could explain why Nesterov-accelerated GJK is worse than GJK in the case of the code snippet above:

It seems you are only using only one pair of shapes for your benchmark. The results in the table you pasted were generated for 500 different random poses. The other plots in the paper were obtained by statistically comparing the results on millions of collision problems (involving thousands of shapes/random poses). The performances of each algo need to be statistically compared.
Depending on the shapes you used, the translation cvx1_transform.translation() << 0, 0, 1.5; may mean that the distance separating the shapes is large. In the paper, we explain that the Nesterov acceleration really benefits GJK when the shapes are in close-proximity. When the shapes are far from one another, GJK would not likely be called, as it is too expensive. The goal of the broadphase is to make sure we don't check intersection/don't compute the distance of distant objects. As explained in the above table's caption, the results in the table are obtained for when the objects are close to one another.
The timings in the table above were obtained by measing the time of gjk.evaluate(...). The memory allocation of the Minkowski difference as well as the initial support/initial GJK guess were removed. This memory allocation probably increases the running time compared to those found in the table.
Finally, it seems you are initializing the GJK algo with the vector (1, 0 ,0) which is a bad initial guess for both GJK and Nesterov-accelerated GJK. In the paper, we explain that we use init_guess = c1 - c2 where c1 and c2 are the centers of the objects AABBs.

Regarding the performance of HPPFCL compared to libccd, could you share with us which libccd function you used to compare to HPPFCL? For example, libccd only does the boolean collision check when ccdGJKIntersect is called. If I remember correctly, the user needs to specify to ccd if they want to compute the distance between the shapes. In the snippet above, GJK is used to compute the distance between the two shapes. It thus goes beyond the boolean collision check which increases the execution time. To use HPPFCL's GJK in boolean checking mode, you need to add gjk.setDistanceEarlyBreak = 0.

Best, Louis

Another remark regarding the convergence criterion of GJK. Currently the default of HPPFCL is GJKConvergenceCriterion::VDB. We kept this default to not break HPPFCL's users code.

On the other hand, to compute distances between shapes, OpenGJK is using the Frank-Wolfe duality gap as a convergence criterion which is different than the FW duality-gap; this criterion is mentioned in the paper and used for the experiments. This corresponds to setting: gjk.convergence_criterion = GJKConvergenceCriterion::DualityGap and gjk.convergence_criterion_type = GJKConvergenceCriterionType::Absolute.

The convergence criterions impact the running time as they give different guarantees of optimality. The advantage of the absolute duality-gap criterion is that if a tolerance of eps is required from GJK, then the 2-norm distance to the optimal solution is guaranteed to be less than sqrt(eps) when GJK has converged.

Hi @OxDuke, I took the script you gave us above and created a quick benchmark to compare Nesterov-accelerated GJK and GJK on two shapes, similar to what is done in the RSS paper. The task in this benchmark is boolean collision checking, similar to what you would find in libccd. You can find the code below. I used meshes from the example-robot-data repository, more specifically meshes from the Panda robot. There are two kinds of meshes in this repo: collision and visual meshes. Visual meshes contain more vertices, so you can change the code below to increase the number of vertices and see how the algos behave. To benchmark both algos, the key is to:

Test them on multiple poses (I used 500 random poses like in the table above)
Divide the results depending on the distance between the shapes

I tested this code on a machine with an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, 64GB of RAM. I am running Manjaro, the code is compiled in Release mode, using clang-13. I get the following results:

Num vertices shape 1: 102
Num vertices shape 2: 152
---- Distance: -0.1 meters ----
Mean running time (in us)
    GJK:      0.832189
    Nesterov: 0.985011

---- Distance: -0.01 meters ----
Mean running time (in us)
    GJK:      0.911282
    Nesterov: 0.909

---- Distance: -0.001 meters ----
Mean running time (in us)
    GJK:      1.1407
    Nesterov: 0.828814

---- Distance: -0.0001 meters ----
Mean running time (in us)
    GJK:      1.17485
    Nesterov: 0.833695

---- Distance: 0.0001 meters ----
Mean running time (in us)
    GJK:      1.07502
    Nesterov: 0.601499

---- Distance: 0.001 meters ----
Mean running time (in us)
    GJK:      0.852024
    Nesterov: 0.474011

---- Distance: 0.01 meters ----
Mean running time (in us)
    GJK:      0.408068
    Nesterov: 0.330122

---- Distance: 0.1 meters ----
Mean running time (in us)
    GJK:      0.290783
    Nesterov: 0.294872

---- Distance: 1 meters ----
Mean running time (in us)
    GJK:      0.288546
    Nesterov: 0.295541

In conclusion, I obtain the same results as in the paper: when the shapes are in close-proximity or shallowly overlapping, the Nesterov acceleration of GJK significantly improves over GJK. When the shapes are distant (1 meter) or have a large overlap (10 cm overlap), GJK is competitive or better than the Nesterov acceleration. Our goal with HPPFCL is to exploit the Nesterov momentum to accelerate cases in which the Nesterov acceleration works the best. These cases are important in physics simulation, especially when dealing with contact scenarios that would for example occur in manipulation tasks.

Regarding the absolute execution time, I find that, for boolean collision checks, HPPFCL is very much below 2.5 microsec for when the shapes are distant of 1 meter (I suppose this distance corresponds to the translation in the code snippet you provided aboce). In fact, I find that the mean running time is actually 0.3 microsec. This is of course impacted by the number of vertices in the meshes. However, even with the visual meshes, I find a running time of 0.5 microsec for a distance of 1 meter.

Here is the code for the quick benchmark:

#include <hpp/fcl/narrowphase/gjk.h>
#include <hpp/fcl/shape/convex.h>
#include <hpp/fcl/shape/geometric_shapes.h>
#include "hpp/fcl/mesh_loader/loader.h"
#include "hpp/fcl/BV/AABB.h"
#include "hpp/fcl/BVH/BVH_model.h"
#include "utility.h"

using hpp::fcl::support_func_guess_t;
using hpp::fcl::Transform3f;
using hpp::fcl::Vec3f;
using hpp::fcl::FCL_REAL;
using hpp::fcl::NODE_TYPE;
using hpp::fcl::BV_AABB;
using hpp::fcl::MeshLoader;
using hpp::fcl::BVHModelPtr_t;
using hpp::fcl::ConvexBase;
using hpp::fcl::GJKVariant;
using hpp::fcl::GJKConvergenceCriterion;
using hpp::fcl::GJKConvergenceCriterionType;
using hpp::fcl::Timer;
using hpp::fcl::details::GJK;
using hpp::fcl::details::MinkowskiDiff;

std::shared_ptr<ConvexBase>
loadConvexShape(const std::string& file_name)
{
  NODE_TYPE bv_type = BV_AABB;
  MeshLoader loader(bv_type);
  BVHModelPtr_t bvh1 = loader.load(file_name);
  bool _ = bvh1->buildConvexHull(true, "Qt");
  return bvh1->convex;
}

std::vector<Transform3f> generateRandomTransforms(int n_transfos){
  FCL_REAL extents[] = {-1., -1., -1., 1., 1., 1.};
  std::vector<Transform3f> transforms;
  generateRandomTransforms(extents, transforms, static_cast<size_t>(n_transfos));
  // The translations are set such that the shapes are not in contact.
  // This allows us to control the distance between the shapes afterwards.
  for(size_t i = 0; i < transforms.size(); i++){
    Vec3f new_translation = transforms[i].getTranslation();
    new_translation.noalias() = 3 * new_translation / new_translation.norm();
    transforms[i].setTranslation(new_translation);
  }
  return transforms;
}

FCL_REAL evalSolver(GJK& solver, const ConvexBase* shape1, const ConvexBase* shape2,
                    std::vector<Transform3f>& transforms, FCL_REAL distance){
  // -- Pose shape 1
  Transform3f tf1 = Transform3f::Identity();

  // -- Minkowski difference and init guess/support
  MinkowskiDiff mink_diff;
  Vec3f init_guess;
  support_func_guess_t init_support_guess;
  Timer timer;

  // Solver used to separate the shapes at distance
  GJK separator(128, 1e-8);
  separator.convergence_criterion = GJKConvergenceCriterion::DualityGap;
  separator.convergence_criterion_type = GJKConvergenceCriterionType::Absolute;

  // -- Loop through all the poses
  FCL_REAL mean_time = 0.;
  FCL_REAL n_problem_solved = 1.;
  for (size_t i = 0; i < transforms.size(); i++) {
    Transform3f T = transforms[i];
    mink_diff.set(shape1, shape2, tf1, T);
    init_guess.noalias() = shape1->aabb_local.center() -
      (mink_diff.oR1 * shape2->aabb_local.center() + mink_diff.ot1);
    init_support_guess.setZero();

    GJK::Status status = separator.evaluate(mink_diff, init_guess, init_support_guess);
    // This allows us to set the distance that separates the shapes
    if (status == GJK::Valid && separator.ray.norm() > 0) {
      Vec3f normal = separator.ray / separator.ray.norm();
      Vec3f dt = separator.ray - distance * normal;
      Vec3f new_translation = T.getTranslation() + dt;
      T.setTranslation(new_translation);

      // We record the execution time
      FCL_REAL mean_time_problem = 0.;
      for (size_t j = 0; j < 100; j++){
        timer.stop();
        timer.start();
        mink_diff.set(shape1, shape2, tf1, T);
        init_guess.noalias() = shape1->aabb_local.center() -
          (mink_diff.oR1 * shape2->aabb_local.center() + mink_diff.ot1);
        init_support_guess.setZero();
        status = solver.evaluate(mink_diff, init_guess, init_support_guess);
        timer.stop();
        mean_time_problem += timer.elapsed().user;
      }
      mean_time_problem /= 100;
      mean_time += mean_time_problem;
      n_problem_solved += 1;
    }
  }
  mean_time /= n_problem_solved;
  return mean_time;
}

int main()
{
  const std::string& file_name_1 = "/home/louis/dev/control/example-robot-data/robots/panda_description/meshes/collision/link0.stl";
  // const std::string& file_name_1 = "/home/louis/dev/control/example-robot-data/robots/panda_description/meshes/visual/link0.dae"; // these visual meshes have more vertices
  std::shared_ptr<ConvexBase> shape1 = loadConvexShape(file_name_1);
  std::cout << "Num vertices shape 1: " << shape1->num_points << std::endl;
  shape1->computeLocalAABB();

  const std::string& file_name_2 = "/home/louis/dev/control/example-robot-data/robots/panda_description/meshes/collision/link1.stl"; 
  // const std::string& file_name_2 = "/home/louis/dev/control/example-robot-data/robots/panda_description/meshes/visual/link1.dae"; // these visual meshes have more vertices
  std::shared_ptr<ConvexBase> shape2 = loadConvexShape(file_name_2);
  std::cout << "Num vertices shape 2: " << shape2->num_points << std::endl;
  shape2->computeLocalAABB();

  // -- Generate random poses
  int n_transfos = 500;
  std::vector<Transform3f> transforms = generateRandomTransforms(n_transfos);

  // -- Solvers
  const unsigned int max_iterations = 128;
  const hpp::fcl::FCL_REAL tolerance = 1e-8;

  GJK gjk(max_iterations, tolerance);
  gjk.setDistanceEarlyBreak(0.); // for boolean collision checking
  gjk.gjk_variant = GJKVariant::DefaultGJK;
  gjk.convergence_criterion = GJKConvergenceCriterion::DualityGap;
  gjk.convergence_criterion_type = GJKConvergenceCriterionType::Absolute;
  GJK nesterov(max_iterations, tolerance);
  nesterov.gjk_variant = GJKVariant::NesterovAcceleration;
  nesterov.setDistanceEarlyBreak(0.); // for boolean collision checking
  nesterov.convergence_criterion = GJKConvergenceCriterion::DualityGap;
  nesterov.convergence_criterion_type = GJKConvergenceCriterionType::Absolute;

  // -- Distances between shapes
  std::vector<FCL_REAL> distances = {-0.1, -0.01, -0.001, -0.0001,
                                    0.0001, 0.001, 0.01, 0.1, 1.};

  // -- Loop through all the distances
  for (size_t d = 0; d < distances.size(); d++) {
    FCL_REAL mean_time_gjk = evalSolver(gjk, shape1.get(), shape2.get(),
                                         transforms, distances[d]);
    FCL_REAL mean_time_nesterov = evalSolver(nesterov, shape1.get(), shape2.get(),
                                         transforms, distances[d]);
    std::cout << "---- Distance: " << distances[d] << " meters ----" << std::endl;
    std::cout << "Mean running time (in us)" << std::endl;
    std::cout << "    GJK:      " << mean_time_gjk << std::endl;
    std::cout << "    Nesterov: " << mean_time_nesterov << "\n" << std::endl;
  }
}

Best, Louis

Here are the results I get for the visual meshes (needs to be uncommented in the code above):

Num vertices shape 1: 499
Num vertices shape 2: 1590
---- Distance: -0.1 meters ----
Mean running time (in us)
    GJK:      2.09625
    Nesterov: 2.53351

---- Distance: -0.01 meters ----
Mean running time (in us)
    GJK:      2.192
    Nesterov: 2.0687

---- Distance: -0.001 meters ----
Mean running time (in us)
    GJK:      2.64804
    Nesterov: 1.73103

---- Distance: -0.0001 meters ----
Mean running time (in us)
    GJK:      3.07247
    Nesterov: 1.71276

---- Distance: 0.0001 meters ----
Mean running time (in us)
    GJK:      2.69307
    Nesterov: 1.29568

---- Distance: 0.001 meters ----
Mean running time (in us)
    GJK:      1.92687
    Nesterov: 0.841321

---- Distance: 0.01 meters ----
Mean running time (in us)
    GJK:      0.820534
    Nesterov: 0.588629

---- Distance: 0.1 meters ----
Mean running time (in us)
    GJK:      0.51323
    Nesterov: 0.527296

---- Distance: 1 meters ----
Mean running time (in us)
    GJK:      0.513308
    Nesterov: 0.531278

Thank you very much for your timely response. I am still cleaning up my code so it might take a while to upload. Since Louis has showcased his benchmarks, I am now trying to reproduce his results.

Conclusion: hpp-fcl is indeed very fast.

Hi guys, I have redone my benchmarks. My latest results do match with what was claimed in the RSS 2022 paper. Thank you for your support!

Here is my benchmark:

I compared four solvers:

openGJK. openGJK uses a linear-time support function, it does not have a log-time implementation as of Jul.19th, 2022. So for convex shapes that has many vertices, it is expected to be slow. The commit I used is c473299.
hpp-fcl(this repo)
fclv6 (fcl v0.6.1) fcl v0.6.1 uses MPR(Minkowski Portal Refinement) by default, and uses a linear-time support function.
fclv7 (fcl v0.7.0) fcl v0.7.0 uses MPR by default, but uses log-time support function when the number of vertices is larger than 32. Ref: https://github.com/flexible-collision-library/fcl/pull/488

For simplicity, benchmarks were done on two convex shapes, which have 532 and 341 vertices, respectively. Since I also want to measure the complexity against number of vertices, a series of mesh simplifications were applied on each one of the two shapes. For shape1, it was simplified to 35, 68, 134, 267, 532(unsimplified) vertices. For shape2, it was simplified to 22, 44, 86, 171, 341(unsimplified) vertices. Each pair of the above mentioned convex shapes were used for benchmarks.

Shape1 with five levels of simplification: link_5

Shape2 with five levels of simplification: link_6

There are 6 sets of benchmarks, each with a different separating distance between two shapes.

Hardware I used was a PC with Intel® Core™ i7-8700K CPU @ 3.70GHz × 12, and 32GB of memory. The system is Ubuntu20.04.3 LTS

The results are shown in the figure below: benchmark_results_combined

There are four big columns in the figure above. Each big column shows the result from one of the four solvers. The name of the solver is printed on the first row of each big column. In each big column, there are 6 tables showing timing results (in microseconds) for each pair of distance test. The header of each table is of format:

ID=xxx, Distance=xxx, Time in us

The ID does not matter, it's for my own use only. Distance stands for the separation distance between two shapes.

I am willing to do more explanation if I have not explained my method & results clear enough.

@OxDuke Thanks for the nice report. We are very happy that you've been able to reproduce our results;

Could you share your code in such a way we can make enhance and push forward for better performances in the near future?

@OxDuke Thanks for the nice report. We are very happy that you've been able to reproduce our results;

Could you share your code in such a way we can make enhance and push forward for better performances in the near future?

No problem, please give me a few days to clean up my code.

@OxDuke Your raw code should be already enough I would say. It seems unnecessary you waste your time. If you have a shared repository, we can have a look and then integrate it into our benchmark pipeline.

Is it fine for you?

I am sorry that I am not able to upload my raw code. Part of my code is running on my company's codebase, so for commerical reasons, I have to rewrite that part in my spare time. Thanks for your understanding.

OK. No problem. We totally understand. Best, Justin

@OxDuke Any update on the code sharing?

humanoid-path-planner / hpp-fcl