Jmeyer1292 commented 9 years ago

The straightforward nature of the core Descartes algorithm contained in planning_graph.cpp lends itself very well to doing parallel computations. For example, the bulk of the computational cost in generating the graph is in calculating IKs where each point is considered on its own. This applies to edge calculations and even searching the graph from multiple starting points.

Descartes could benefit massively from splitting up the graph generation step into threads. This imposes the interesting requirement that all classes derived from TrajectoryPt and RobotModel must be thread-safe. This is currently not the case with the moveit state adapter because RobotModel is not thread safe.

I'm not sure what the right solution is. The easiest solution would be to copy the users RobotModel into each thread doing computation but that would require adding a clone() function to Descartes' RobotModel interface.

We could also create an alternate interface that inherits from the current but adds the clone method. This would be fed to an alternate planning algorithm that takes "ThreadSafeRobotModel" for example. This might involve a lot of code duplication.

gavanderhoorn commented 9 years ago

MoveIt has parallelised its planning in the recent release, might be an idea to see how they solved these issues? note: I don't know whether they have thread safe datastructures, so it might've been a non-issue there.

dpsolomon commented 9 years ago

Moveit spent a lot of time making most of the const members thread safe as part of that effort. I think that's the best way to approach this. The planning scene diff() function is an interesting approach to "cloning" a robot state, but may be too complicated to warrant the effort

Jmeyer1292 commented 9 years ago

So in moveit's JointModelGroup class there is a function

const kinematics::KinematicsBaseConstPtr & getSolverInstance () const

which besides having a wierd function signature provides us with the functionality needed to do forward and inverse kinematics.

http://docs.ros.org/indigo/api/moveit_core/html/classkinematics_1_1KinematicsBase.html

No where does it say it's thread safe explicitly, but it does have a very "functional-programming" set of function signatures.

Another solution that only addresses parallelism in the graph building portion of the planner is to add a method to descartes' RobotModel interface that takes as input a vector of cartesian points to solve and returns all of the joint solutions for all of the points. The default implementation would be to just repeatedly call the single point getIK for each point in the vector, but others could override it. We would create implementation that used threads to split up the work load.

This would allow us to parallize graph building in our own RobotModel however we like, and Descartes would only need minor modifications. This might impose some more memory overhead because we'd have to keep copy of the points array around. This solution also would not do anything to improve graph searching which can itself take a very long time.

dpsolomon commented 9 years ago

Because the kinematics solvers are offered as plugins, with the idea that anyone can easily implement and integrate their own, it is understandable that they do not assume or require that the kinematics solver be threadsafe. Also, I think it is very hard to write a base class that enforces thread safety in derived classes.

I agree that at this time it is difficult to parellelize the graph solve, since it uses boost graph, and most of the implementation is already accomplished in there.

Adding the get multiple ik call has a lot of promise I think. It's there an advantage to putting it there vs in the planner?

BrettHemes commented 8 years ago

+1

Despite the apparent difficulty, this would be awesome! It brings a tear to my eye watching the planner run on a nice machine, only to have it use a single core...

I would be willing to help out on this but doubt I could effectively tackle this without at least some direction from you guys.

Jmeyer1292 commented 8 years ago

Hey @BrettHemes, sorry for the slow response, I've been on travel.

I agree with the need to need to do some of this, and it was a real focus of my efforts in my planning graph refactor (#179). While not in there atm, thread-parallization can be achieved with ~ 3 lines of #pragma commands using OMP.

I'm not entirely sure where effort is best spent. I really want to transition the library towards the changes made in that PR, but it's a lot of work and may take a bit. If time is of the essence, then I'll look into what we can do right now.

BrettHemes commented 8 years ago

@Jmeyer1292 Allright, I haven't had time to dig too deep into #179 (yet) but it looks really cool. I think getting that merged trumps optimizing the current approach if that is indeed the direction Descartes is headed.

That said, I did some runtime analysis and the biggest hit from enabling collision detection is in calculateEdgeWeights (makes sense). I got a working modification that parallelizes this function using OMP with each thread using a clone of the planning graph's robot_model. This is similar to #58 but much less involved due to using OMP. With 12 cores I am seeing a realistic 10x speedup in the joint solution computations. This gets me back down to feasible execution times :smile: I think also that this approach will play nice with #179 (need to investigate further) and thus might be worth merging. If you are interested, let me know and I get it pushed to my public fork in a week or so.

Jmeyer1292 commented 8 years ago

@BrettHemes I'd be interested for sure.

Could you elaborate on the runtime analysis you did? calculateEdgeWeights has been the most expensive part of the Descartes search without collision checking, but I assumed that the cost would be shifted into getJointSolutions() and getJointPoses where the actual checking is being done (as part of isValid).

I realize I can run these trials myself as well, but I'm just curious what you saw.

BrettHemes commented 8 years ago

@Jmeyer1292 Yeah on one of my test trajectories (~900 points, orientation tolerances on all three axes, IKFast) I get the following run time allocation:

No Collisions (single threaded) Joint Solutions: 30s Edge Weights: 35s Graph Vertices: 4s Graph Edges: 6s

With Collision Detection (single threaded) Joint Solutions: 828s Edge Weights: 28s Graph Vertices: 3s Graph Edges: 5s

With Collision Detection (12 threads) Joint Solutions: 138s Edge Weights: 29s Graph Vertices: 3s Graph Edges: 4s

So not quite 10x on this example but still significant. The part mesh was on the order of 500kb.

Tomorrow I might see if I can't get the edge weight computations parallelized as well. Thread safety is less trivial here but I have at least one idea on how to make it work. I will also start the release process so I can push these modifications to GitHub.

Jmeyer1292 commented 8 years ago

Oof, those times are brutal.

If I wanted to recreate something similar to your path, what kind of discretization are you using on the angles?

BrettHemes commented 8 years ago

The extra orientation tolerances add crazy overhead. Really good example of the curse of dimensionality...

BrettHemes commented 8 years ago

Sweet, the edge weight computation worked with little modification and I ended up with the following run time on the same trajectory:

With Collision Detection (12 threads) [parallel joint solutions and edge weights] Joint Solutions: 140s Edge Weights: 5s Graph Vertices: 3s Graph Edges: 4s

So 6x improvement on the edge weights with 12 cores... not too bad. So I think if I get creative with the collision matrix and/or meshes I can get this a bit faster yet. And #179 should help as well!

Jmeyer1292 commented 8 years ago

179 does a lot to improve the core algorithm in terms of machine-friendliness, and it will use less memory and make edge computations faster. I hope you see close to linear scaling in all of your OMP directives with these changes.

We'll need to look at what we can do to:

Do less collision checking (lazy checking or maybe some kind of pruning (no incoming edge means no checking right now).
Improve the collision checking that's there. I don't understand enough of what's going on to say it's working for us in the most efficient way.

I'll set up some benchmarks tonight with collision objects so that I can contribute to this effort with actual measurements.

Very excited to see your getting enough scaling to make things feasible.

BrettHemes commented 8 years ago

All right, changes in #182

yijiangh commented 7 years ago

@Jmeyer1292 Yeah on one of my test trajectories (~900 points, orientation tolerances on all three axes, IKFast) I get the following run time allocation: ...

@BrettHemes Would you mind giving me an estimation how many vertices (ladder graph vertices) do you have in this benchmark? With LadderGraph data structure, I'm wondering if you have encountered issues with RAM overflow?

For me, using ladder graph used in kinetic_devel, I ran out of my 16G RAM with a ladder graph with 3205293 vertices, 1007 rungs.

I'm trying to diagnosing the problem source right now. It would be great if you can provide some insight on this.

BrettHemes commented 7 years ago

@yijiangh I am still on a fork of indigo-devel that has the first part of @Jmeyer1292's refactor but not the most recent stuff from the top of kinetic-devel. That said, I can't tell you how many vertices and rungs exist for my paths of ~5k waypoints.

BrettHemes commented 7 years ago

@Jmeyer1292 how are you feeling about your kinetic-devel branch progress? I need the OMP parallelization for my application to work and tried to quickly rebase on kinetic-devel but the newest changes are such that I will have to massage the OMP stuff a bit. If you like where yours is and are open to eventually merging the OMP stuff, I will work on getting it up to date with the HEAD of kinetic-devel, otherwise I will hold off and stick with my private fork of indigo-devel.

yijiangh commented 7 years ago

@BrettHemes +1 for updating kinetic-devel with OMP. The Parallelism speedup on graph construction would be essential for my project too. I'd love to help and test the OMP kinetic-devel rebase!

And thanks for letting me know about your test with ~5k points, I feel like that the RAM issue will eventually strike whenever a huge ladder graph is constructed (a lot of vertices in each rung, like the situation that I encountered in my app #200 ). I would look into that more closely and report back soon.

Jmeyer1292 commented 7 years ago

@BrettHemes Give me this week. I'll get the outstanding pull requests merged in, including the OMP stuff.

@yijiangh Are you sure it was purely the size of your graph? 3205293 vertices 6 dof 8 bytes per vertex = 153854064 bytes. Divide that by 1e9 = 0.15 GB. Is it all edges then? Are you discretizing very finely?

I only ask because I've observed but not hunted down and issue in my own work where the code that samples cartesian points gets stuck in an infinite loop until it exhausts my system ram.

BrettHemes commented 7 years ago

Thanks @Jmeyer1292!

yijiangh commented 6 years ago

@Jmeyer1292 Thanks for your comments and sorry for my delayed response. After I went back to check the result, I found that indeed edges' storage is the major issue.

Let's take a model with 250 elements as an example, where the model has 4,370,024 vertices, 1,974,038,700 edges (!), and 1426 rungs. The vertices only occupy 4,370,024 6 (dof) 8 bytes = 0.2G, while the edges takes up 1,974,038,700 * (8 bytes (double - cost) + 4 bytes (unsigned int - idx)) = 23.7 G !!!

I think some kind of informed edge construction would be very helpful to alleviate this issue, especially for edges between unit ladder graph (appendInTime function here).

@Jmeyer1292 @BrettHemes Thanks for your comments! More thoughts and comments are welcomed!

Jmeyer1292 commented 6 years ago

@yijiangh A couple of thoughts:

Instead of naively appending graphs, we could come up with a data structure that understands that free-space segments will have all edges connected, and then store only edges that have failed. Or something like that. Should give you equivalent performance without the huge edge storage cost.
We can store edge costs as floats instead of doubles. Not a huge improvement, but we don't really need the extra significant figures here and it would probably make everything a little more scalable and aligned.

yijiangh commented 6 years ago

@Jmeyer1292 Thanks for your feedback as always.

Actually, I just came across a new idea last week and tested it with a quick implementation. The central idea is instead of enumerating all the poses in each rung and connect all the edges between them, we first run an algorithm to determine the optimal poses for adjacent rungs with shared EEF pose and construct a unified ladder graph based on these optimal poses to search for joint trajectory on it.

As my application (#200 ) needs strict cartesian paths for groups of adjacent rungs, I group them together and call them a capsule (or a snapshot). Our goal is to compute the optimal EEF poses for these capsules, where the cost is the L1 norm between adjacent capsule's end joint pose and start joint pose. I only record the start and end joint poses for each capsule for computing the cost between capsules. To efficiently search for the optimal capsule path (optimal poses for groups of rungs), I apply RRT* on this capsulated ladder tree. Although we lose determinacy from this sampling-based method, we do have asymptotic optimality guarantee. Moreover, we don't need to discretize the 2pi domain for z_axis_symmetry pose!

This will allow us to quickly detect if there's any infeasible capsule (all rungs inside it are empty.) and obtain near-optimal poses quite efficiently without worrying about memory overhead. For a problem with 70 capsules (approx. 800 rungs), the initial solution can be found in 3 secs with cost 246 and it can be optimized down to cost 30 after 300 secs of running RRT*. More importantly, my RAM is happy without any fluctuation observed!

After getting the optimal poses from these capsules, we expand each capsule into a unit ladder graph and concatenate them together into a unified ladder graph. Then we can run our beloved DAG search on this leaner and more focused ladder graph for a full joint trajectory solution.

Sorry for my quick and abstract explanation, maybe the drawing below can make it easier for you to understand. The highlighted pink blocks indicate capsules, which encapsulate adjacent rungs with shared EEF poses.

clt-rrt-sketch-12012017180923_page_1

clt-rrt-sketch-12012017180923_page_2

A quick implementation can be found here:

Key data structure: CapsulatedLadderTree

RRT* CapsulatedLadderTreeRRTStar.

Any comments are welcomed!

Jmeyer1292 commented 6 years ago

@yijiangh Very cool. Thank you for taking the time to explain this to a wider audience: it's work like this that makes me want to be part of our little community.

I'll take the time this weekend to go through the process, but I just want to make sure I understand:

You have a big motion planning problems that involves some number of cartesian segments where the extruder is on, and freespace paths between them. The cartesian segments are ordered in time.

A cartesian segment can be satisfied by one of possibly many families of kinematic solutions (defined by a tool axis and a rotation about that axis). Your capsule is a sparse representation of one of those families. So cartesian_segment 1 might have 3 valid tool directions and each direction might search about the z axis by 180 degrees = 3 x 2 = 6 capsules for that segment.

You store and search this sparse representation, then expand a capsule into a full graph with each individual pose when you've reduced the size of your problem. Is this correct?

yijiangh commented 6 years ago

@Jmeyer1292 You are correct. The Central theme here is first searching on this sparser representation and then "zoom-in" to apply search. Sampling-based search on the capsule tree structure can also help us bypass the problem caused by discretization granularity on tool z-axis's rotation angle.

I just update the conceptual hand drawing above to give more detail on this.

dbdxnuliba commented 3 years ago

thanks a lot ,can you provide the thesis related to decartes

ros-industrial-consortium / descartes

Parallelism in planning graph #55

179 does a lot to improve the core algorithm in terms of machine-friendliness, and it will use less memory and make edge computations faster. I hope you see close to linear scaling in all of your OMP directives with these changes.