opensim-org / opensim-core

SimTK OpenSim C++ libraries and command-line applications, and Java/Python wrapping.
https://opensim.stanford.edu
Apache License 2.0
801 stars 324 forks source link

OpenSim IK slow on Linux #3854

Open davidpagnon opened 4 months ago

davidpagnon commented 4 months ago

Note: also posted on the forum Hi,

I installed the opensim conda package, and I noticed that inverse kinematics is much slower on my Ubuntu server than on my laptop. This is annoying, since I need to process a massive amount of data that I cannot download on my computer.

Do you have any idea why, and how to speed it up?

Here are my specs: LAPTOP:

Windows 11 CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz 2.59 GHz RAM: 32 Go python -V = 3.9.19 opensim.version = 4.5

SERVER:

Ubuntu 22.04.4 LTS CPUs: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (16 of them) GPUs: GeForce GTX 1080 Ti, GeForce GTX 1070,GeForce RTX 2080 RAM: 251Gi Mem, 251Gi Swap python -V = 3.11.9 opensim.version = 4.5.1

My laptop is about 6-7 times faster than the server on a simple IK task. I first tried with Python 3.8 and OpenSim 4.4.1, with the same results.

Thank you in advance!

nickbianco commented 3 months ago

Hi @davidpagnon, based on a quick CPU benchmark comparison, your laptop CPU seems to have better performance on a single thread compared to your Linux server. Inverse kinematics in OpenSim is single-threaded, so you could take advantage of more threads on the Linux server.

davidpagnon commented 3 months ago

Thanks @nickbianco for your answer! Do you think it would make it 7 times (or more, really) slower than on my laptop?

Like I said on the forum, I have the feeling there could be another cause:

However, it does not seem like OpenBLAS accelerates linear algebra as much (computation speed is about 7 times slower). More information there: https://numpy.org/install/#numpy-packages--accelerated-linear-algebra-libraries ... -libraries

Out of curiosity, what prevents you from using MKL?

nickbianco commented 3 months ago

I'm sure if the CPU benchmarks explain fully the 7x speed difference, but regardless, it would be good to have a sense of have much faster/slower the machines are with a 1-to-1 comparison.

The speed differences shouldn't have anything to do with NumPy: all the inverse kinematics code in OpenSim runs in C++.

In OpenSim 4.5.1, we upgraded to a more recent version of Ipopt, which is more likely to explain the differences, but I still wouldn't expect a 7x slow down.

aymanhab commented 3 months ago

Let me clarify that as @nickbianco suggested the IK layer/solver executes entirely in C++ stack, using Simbody's own "separate copy" of ipopt and as such is very unlikely to be affected in anyway by either using openblas, mkl or upgrading the ipopt stack.

Unrelated but may be relevant, we noticed that having constraints in the model triggers projection workflow which is significantly slower so you may want to disable constraints unused in ik but this has little to do with the numerical libraries.

davidpagnon commented 3 months ago

Okay, thanks for your clarification, so "nomkl" is unlikely to have anything to do with the speed difference. I do have constraints in my model, so that could explain it; but I cannot really disable them.

So I don't see any obvious solution, I guess I do not need it to run lightning fast. Thank you for your help!

halleysfifthinc commented 3 months ago

unlikely to be affected in anyway by either using openblas, mkl

I have actually observed performance differences between BLAS'es. Not on the order of a 7x difference, but MKL can be faster than OpenBLAS. Potentially relevant, OpenBLAS internally parallelizes matrix ops (I believe MKL does too, but not the generic Ubuntu libblas). I noticed when running many separate opensim IK trials in parallel that the OpenBLAS parallelization was slightly slower than the generic Ubuntu libblas, but more importantly it resulted in oversubscribing my CPU and slowed down all the independent instances of opensim running IK. It might be worth checking if OpenBLAS in python exposes the API for setting the number of BLAS threads.