Open Erfi opened 1 year ago
It's a very good question. Here are my thoughts
I like Figure 5 in the current version of the paper in overleaf. I would like to keep the same figure or replace it with the similar one with only the end-effector positions to save some space.
I would generate a figure with training progress. I'm not sure if we should add it to the paper or to the appendix (That we submit as additional material).
I would definitely submit a video with demonstrations. First, sample a bunch of goal configurations far from the wall and ask the policy to go through them. Second, I would sample a bunch of points close to the wall and ask the robot to go go through them.
I don't know what is the best way to do it but I would show how policy exploits the flexibility of the robot. If it tries to avoid it and go as rigid as possible, or it takes advantage of it.
What do you think? What else could we add?
1) A table of (mean rewards, mean path length) over 10-20 evaluation rollouts for all IL, RL, IRL [without safety filter]
Select the winner (DAGGER, SAC, MPC, DAGGER+SF, SAC+SF, MPC_HOR1, MPC_HOR2)
2) scatter plot x-axis: Computation Time Per Action y-axis: Number of Constraint Violations (best is lower left)
3) Discussion
4) Video
@shamilmamedov for point 1 (comparing all algorithms based on reward) I can plot a boxplot of the trajectory rewards (this one with only 5 rewards and later with 100 trajectories for each algorithm (currently waiting for the trajectory collection to finish)).
right now the goals are sampled everywhere (100 trajectories for each algorithm, with 100 randomly sampled goals). You mentioned before that we might want to also have trajectories where the goals are sampled near the wall. In that case can you give me the start and end for qa_goal
? We will need to change the qa_goal
to qa_goal_start
and qa_goal_end
and sample in that range.
I thought that in the point one we compare only IL algorithms in the env without the goal and chose the best one for further experiments.
Since our reward is the - dt* |p_goal - p_ee|_2 indeed we don't need to separately plot the path length
For the point two, it would have been great if we could sample the goal position and the initial position from different sets. It is what you meant, right? I will provide the ranges tomorrow
yes, exactly. For point one the goal can be anywhere. But for point two let's have specific ranges.
For the second point, the range of angles for the goal position are
qa_range_start = np.array([-np.pi / 12, 0.0, -np.pi + 0.2])
qa_range_end = np.array([-np.pi / 12, np.pi/2, 0.0])
(with 10 trajectories per algo, goal sampled near the wall)
This shows that using the safety wrapper affects the performance (reward) quite a lot. We should definately do what you set and set the ref_u to something more meaningful than zeros.
It's to be expected that safety wrapper will affect the performance, it is gonna make the policy more conservative. However, we should indeed tune the safety filter as much as we can. I'll try to fix ref_u now
another plot but (inference time) vs (violations)
Surprisingly the constraint violation doesn't go down much when using the safety filter. It's not a good news(
True, for SAC it seems to be helping more tho
We need to compute what is the gain in terms of constraint violation uusing SF, maybe because of the X-axis being big we don't see the improvement
I am not clear what you mean by that can you explain a bit more, and how would you think we can implement it?
Sorry, I was talking about the figure.
It seems like for DAGGER it actually increases the violations slightly
Too bad. Perhaps we should separate the constraints into critical like the wall and the ground, and non-critical ones like joint constraints
I think that might be a good idea. let me replot only with wall and ground constraints and see what happens
Yes! (only wall and ground constraints)
Awesome! Theses constraints are really critical thus we can argue why we chose them
Ideally, we should also change the stiffness of the joints and see how robust policies are. For that some time ago I changed the SymbolicFlexibleArm3DOF
to accept the flexibility params. If we could somehow pass those params through env, then we could test the policies and put the results to paper as well
So for joint-flexibility experiment, we want to see the performance of the trained algorithms of a version of the environment that has a different flexibility parameters. What about the MPC or SafetyFilter? As I understand the MPC then will use the old flexibility parameters (There will be a deliberate missmatch). What about the SafetyFilter? Should the SF be based on the new flexibility parameters or the old ones?
SF will also use old flexibility parameters. We will introduce model-plant mismatch at all levels. I speculate, that SF might be a bottleneck
@shamilmamedov Ok, so for the robustness plot (modifying flexibility parameters)
x-axis: constraint-violations
and y-axis: rewards
with purple dots for matching model (what we currently have) and orange for mis-matched (more flexible) env?In the models folder, there are two .yml
files: flexible_link_params.yml
and flexible_link_params_test.yml
. If not provided, SymbolicFlexibleArm3DOF
defaults to flexible_link_params.yml
. If we pass path to flexible_link_params_test.yml
through fparams_path
to SymbolicFlexibleArm3DOF
then we will have a model with different flexibility parameters.
Right now the test flexibility parameters (only the stiffness) are 10% different from the training one.
Scatter plots sounds good to me. If we don't have space, then we can use a table.
Description: To help us decide how to run the evaluations and which data to save during training / evaluation, we should decide on what plots we would like to see on the paper.