Closed hzyjerry closed 4 years ago
Yes, it is. The easiest is to have one Data and GeomData object per threads.
@jmirabel thanks for the note! I wonder if this would induce the same speed for 1000 GeomData object vs 1 GeomData object?
For my use case I will be using the same GeomData, at 1000 different configurations
1000 GeomData
?!? how many cores do you have ?
I wonder if this would induce the same speed for 1000 GeomData object vs 1 GeomData object?
My experience of multi-threading tells me that you will not be 1000 times faster. You can expect to have more cache miss with several cores at the same time. I advice you to carefully benchmark you code. Sometimes running with 2 threads is actually slower...
I think @hzyjerry wants to do batch computations in similar way it is done in deep learning with GPUs. A short answer to @hzyjerry is that Pinocchio is not currently done for batch computations, but very tailored for multi-{CPU,process} or parallel computations.
There is a roadmap on adjusting Pinocchio for batch computations, but we are unfortunately lacking of manpower for it.
@jcarpent yup that's the ideal function that will suit my use case :)
Closing the issue for now. It would be very exciting to see progress in this direction (hopefully someday I can also contribute).
@hzyjerry We will really appreciate your contribution then.
Hi Pinocchio team, do you guys have any idea regarding the speedup that one could expect from using batch simulation instead on parallel computing. By experience, batch simulation is usually much faster on CPU for small matrices (robots with less than 20 DOFs). I know it would be true for GPU, so I'm more interested in feedbacks regarding batching on CPU.
Can you describe what you call batch simulation ?
Yes sure: it consists in adding an extra dimension (corresponding to the ID of the robot) to every tensors used for physics calculus. This way, one can perform computations on those tensors directly without looping over every individual system explicitly, in order to get the most out the low-level linear algebra libraries. For instance, the position vector (robot.nq, 1)
becomes a position matrix (num_robots, robot.nq)
, or the mass matrix a 3D tensor (num_robots, robot.nq, robot.nq)
. It is a misuse of language coming from machine learning field.
It is relevant for GPU indeed because of the presence of thread. For CPU, it is less obvious, memory alignment matters.
I'm currently porting a part of Pinocchio on GPU ... but still works to do. This won't require to support the notion of a batch, with the main benefit of having a code compatible for GPU and CPU.
@duburcqa Which function do you see the real gain of batching? By the way, collision detection would be hard to batch ...
On collision detection, I second @jcarpent : it is hard to achieve. Yet, there are some stuff which can be done.
For CPU, it is less obvious, memory alignment matters.
Adding an extra dimension to tensors contradict memory alignment ?
This won't require to support the notion of a batch
What do you mean ? It will be enable to leverage the computational power of a high-end GPU (e.g. RTX 3090 or Tesla A100) without batching but using GPU multi-threading ?
Which function do you see the real gain of batching?
I was thinking about speeding up time-consuming rigid body algorithms such as forwardKinematics
, in order to simulate multiple identical and independent robots at once as fast as possible for reinforcement learning applications. Regarding stuffs like collision computations, it would awesome but it was not expected that much since it relies on other libraries.
For CPU, it is less obvious, memory alignment matters.
Adding an extra dimension to tensors contradict memory alignment ?
Pinocchio is highly efficient because all the computations are predictable by the CPU with a kind of continuous flow of operation. When working with large batch, you will penalize this aspect, because you will work on larger memory segments which might impact the catch and the memory loading.
What do you mean ? It will be enable to leverage the computational power of a high-end GPU (e.g. RTX 3090 or Tesla A100) without batching but using GPU multi-threading ?
It might be not needed to explicitly support the notion of batch for the standard code. You can be much smarter by indeed exploiting the GPU thread, but there is no magic here, we need to implement this dispatcher, which is hard to make it generic. Won't you think so @duburcqa?
I was thinking about speeding up time-consuming rigid body algorithms such as
forwardKinematics
, in order to simulate multiple identical and independent robots at once as fast as possible for reinforcement learning applications. Regarding stuffs like collision computations, it would awesome but it was not expected that much since it relies on other libraries.
Come on, forwardKinematics
is one of the cheapest operation in Pinocchio (1 or 2 us for 40-dof legged robots).
You won't gain anything by using GPUs due to the data transfer!!!
If you use the code generation support of Pinocchio, you can really decrease these timings by 5 or even 10.
Regarding stuffs like collision computations, it would awesome but it was not expected that much since it relies on other libraries.
Pinocchio and FCL are closely related.
Come on, forwardKinematics is one of the cheapest operation in Pinocchio (1 or 2 us for 40-dof legged robots).
That was just an example, basically it affects any rigid body algorithm. And regarding 1us for 40dof system, it is still very slow compared to Nvidia Isaac Gym, which has a simulation-to-real time ratio above 5e4 (1e-7s to perform a single integration step, requiring 5 evaluations of the dynamics since it uses RK45 internally) for a system with 24 dofs on a single GPU.
If you use the code generation support of Pinocchio, you can really decrease these timings by 5 or even 10.
Yes this is an interesting feature, but it is hard to take advantage of it in Python since no JIT compiler support is provided for Pinocchio, and in reinforcement learning it is unusual to pre-compile C++ by yourself for a specific application.
Yes, you can do JIT with Pinocchio !!!
Yes, you can do JIT with Pinocchio !!!
Hum, very interesting. I was not aware of such feature. I will have a look at it then.
That was just an example, basically it affects any rigid body algorithm. And regarding 1us for 40dof system, it is still very slow compared to Nvidia Isaac Gym, which has a simulation-to-real time ratio above 5e4 (1e-7s to perform a single integration step, requiring 5 evaluations of the dynamics since it uses RK45 internally).
@duburcqa you are a bit unfair. How many (talented) engineers are working behind Nvidia simulator? On Pinocchio, we don't have such a task force ... in order to take profit from the latest GPU cards and so on. We are making our best, with the money and time we have. Concerning their simulator, I guess they are not working with generalized coordinates.
Any contribution from you @duburcqa is really appreciated to speed up and robustify Pinocchio.
By the way, if you want to speed your computation, you can leverage floating-point arithmetic computations.
I'm not criticizing Pinocchio. I think it is an amazing library, and I'm clearly willing to invest time and eventually money in it in the future if what I have in mind goes as intended. You guys surely did an great job. My point is, maybe there is room for speed improvement in the particular setting of "massively" parallel simulations, by taking advantage for vectorization for batch simulation or GPU thread, even for already very effective algorithms.
I just wanted you to help to assess the speedup what one could expect from doing so in Pinocchio, along with the man power that it may require. I'm not complaining about anything nor requiring you to implement anything here. Hence my initial question "do you guys have any idea regarding the speedup that one could expect".
Thanks for the clarification. For the moment, the initiative is to do code-generation on GPUs, with the benefit of exploiting the thread capacities of GPUs. This will be much more simple, than trying to cast RBD algorithms for batch computation according to me. But this my only answer to exploit the capacities of GPUs.
Thank you, this is insightful. I will try to look in this direction and see if I can come up with something in the next months. It is not a priority for me now, since I have access to enough computational resources to fit my needs, but it will probably become a limitation for me at some point.
Yes indeed it is not mentioned explicitly. It tried to infer it by myself, based on the comparison for "Learning Dexterous In-Hand Manipulation": It says that the whole learning process took 10 hours using Nvidia Isaac Gym using Nvidia A100 GPU only, while it tooks 30 hours to OpenAI to do the same. So it is 3 times faster. Then, in the above--mentioned article, it states that "This setup allows us to generate about 2 years of simulated experience per hour.", corresponding to a simulation-to-real time ratio of 17500. Next, it is mentioned that the integration step is 8ms, so computing a single integration step takes them 4.5e-7s if I'm not mistaken. Considering that Isaac is 3 times faster, it takes about 1.5e-7s to perform a single integration step. Finally, according to the documentation of Mujoco, they are using RK4 (not 45 apparently as I mentioned).
Thanks for your clarification. But without more details from NVidia on their setup and with reproducible code I found difficult to make any meaningful comparison. It is not clear that they used the same integration step than the one from the mentionned paper. They may also have used a different formulation for the simulation.
The same goes with MuJoCo. It is a very impressive DDP for rigid body mechanical system using a specific convexification of the contact model. We used it during 6 months and we did not find the right weights to make it works on our humanoid robot [1]. This convexification is particularly important to make the solver converging quickly. If you are interested by the "functional level" provided by MuJoCo then you can have a look to Crocoddyl (https://github.com/loco-3d/crocoddyl). It tooks us quite some time to make it work on our humanoid robot with several other ways to simplify the problem [2]. And what is working on a robot such as HRP-2 is not directly applicable to any other robot.
For now the only description I found on Isaac functionnally equivalent to MuJoco, or Crocoddyl regarding trajectory optimization is here: https://docs.nvidia.com/isaac/isaac/packages/planner_cost/doc/planner_cost.html And it is only a classical LQR.
Thanks again for the pointer to Isaac, the videos are pretty impressive, I would love to read a scientific paper giving more details on this framework.
[1] Whole-body model-predictive control applied to the HRP-2 humanoid, Jonas Koenemann, Andrea Del Prete, Yuval Tassa, Emanuel Todorov, Olivier Stasse, Maren Bennewitz, Nicolas Mansard , IROS 2015. https://hal.archives-ouvertes.fr/hal-01137021/file/MuJoCo%20on%20HRP2.pdf [2] Differential Dynamic Programming for Multi-PhaseRigid Contact Dynamics Rohan Budhiraja, Justin Carpentier, Carlos Mastalli, Nicolas Mansard, ICHR 2018
If I am understanding correctly the documentation, the physical engine of Isaac Gym is either PhysX or the physical engine of Unity/fastsim/Omniverse. Then pinocchio is more comparable to PhysX.
Thank you for the interesting feedback. Could you give us a reference on this particular point ? I had a look to the Isaac Gym SDK and what I found was not as precise as what you wrote. And it looks like Isaac Gym is providing many many features.
Yes indeed it is not mentioned explicitly. It tried to infer it by myself, based on the comparison for "Learning Dexterous In-Hand Manipulation": It says that the whole learning process took 10 hours using Nvidia Isaac Gym using Nvidia A100 GPU only, while it tooks 30 hours to OpenAI to do the same. So it is 3 times faster. Then, in the above--mentioned article, it states that "This setup allows us to generate about 2 years of simulated experience per hour.", corresponding to a simulation-to-real time ratio of 17500. Next, it is mentioned that the integration step is 8ms, so computing a single integration step takes them 4.5e-7s if I'm not mistaken. Considering that Isaac is 3 times faster, it takes about 1.5e-7s to perform a single integration step. Finally, according to the documentation of Mujoco, they are using RK4 (not 45 apparently as I mentioned).
If I am understanding correctly the documentation, the physical engine of Isaac Gym is either PhysX or the physical engine of Unity/fastsim/Omniverse. Then pinocchio is more comparable to PhysX.
I don't know much about this much.
Thanks for your clarification. But without more details from NVidia on their setup and with reproducible code I found difficult to make any meaningful comparison. It is not clear that they used the same integration step than the one from the mentionned paper. They may also have used a different formulation for the simulation.
I agree with you that the information provided by Nvidia are oriented to increase the hype around their framework, not really to be informative unfortunately. For now it is mainly advertising. I guess actual benchmark will be available in the next few mouths. Yet, even though the comparison is not meaningful because of the lack of information, the overall performance is quite impressive (1 GPU vs 6000 CPU cores).
The same goes with MuJoCo. It is a very impressive DDP for rigid body mechanical system using a specific convexification of the contact model.
Yes I very aware of their famous convexification procedure and how difficult it can be to make it works in practice. I'm quite impress by MuJoCo on this point. If you are interested in yet another simulator, have a look to tiny-differentiable-simulator by Erwin Coumans. It is using impulse LCP with Baumgarte stabilization for the contact model.
If you are interested by the "functional level" provided by MuJoCo then you can have a look to Crocoddyl (https://github.com/loco-3d/crocoddyl).
Yes I know Crocoddyl too but I'm not using it. I'm more interested in physics engine / simulation libraries than optimal control libraries.
I agree with you that the information provided by Nvidia are oriented to increase the hype around their framework, not really to be informative unfortunately. For now it is mainly advertising. I guess actual benchmark will be available in the next few mouths. Yet, even though the comparison is not meaningful because of the lack of information, the overall performance is quite impressive (1 GPU vs 6000 CPU cores).
Yes this is quite impressive, I wonder if they used a new kind of chipset or if this is purely computer programming.
I wonder if they used a new kind of chipset
They are using their new professional flagship A100 worth 25000€ by unit. It is quite a huge technological step regarding several aspects (2To/s bandwith, GPU partioning support...). It is expected to be about 5 times more powerful than the previous flagship V100 in average for 80go variant (from 3x to 7x depending on the application).
Considering that Isaac is 3 times faster, it takes about 1.5e-7s to perform a single integration step. Finally, according to the documentation of Mujoco, they are using RK4 (not 45 apparently as I mentioned).
The numbers seem impressive. Yet, we have to make the distinction between average computation, and the call to a single simulation step. Simulation algorithms are very hard to parallelize: you can simulate several instances on different threads, but you cannot make a single step parallel due to the physics equations. So, if you have the timings for this simple call, I would be interested and it can be used as a golden target to reach.
@jcarpent thanks for the remarks on how to exploit GPU for Pinocchio! I'm playing with GPU multi-thread programs, and I have some questions:
CppADCodeGen
support code generation needed by Pinocchio on GPU? If not, what framework do you have in mind?Thanks!
@hzyjerry Feel free to contact me on my email address to discuss this point.
@jcarpent What not to open a discussion on that topic instead ? GPU acceleration is a hot topic and I guess others may be interested to discuss this point.
@jcarpent What not to open a discussion on that topic instead ? GPU acceleration is a hot topic and I guess others may be interested to discuss this point.
Indeed, it could be beneficial for derived projects such as Crocoddyl
I don't want to animate a chatroom as I've currently a few amount of my times for that. I would really prefer to work with people that can dedicate time soon to make a big push. @hzyjerry appears as a motivated person. If you think @duburcqa you may dedicate some of your time, I'm also open to add you to a dedicated channel.
@jcarpent No problem, I can understand that. Probably later then!
Thanks for your awesome work on pinocchio.
I'm wondering if it's possible to perform batch collision checking? For instance, I would like to sample 1000 different poses for my robot model in a fixed workspace, and check % of them collide with the obstacle. This would be very helpful for performing sampling-based motion planning tasks.
Thanks!