Addressing comments of L4DC reviewers

I would like to discuss several comments and recommendations provided by the reviewers of L4DC. I believe it's crucial that we carefully consider these points as we work on revising the paper for ICRA. I will briefly outline the relevant comments so that we can discuss the actions we can take to address them in the ICRA paper.

Reviewer 1

"Moreover, there are no insights offered on how the training is performed, what kind of NN is used, how many layers, what kind of activation functions, etc." Indeed, experimenting with various architectures, optimizers, and hyperparameters to find the best combination for approximating NMPC might be valuable for the community. @Erfi In your pipeline, is it easy to experiments with those variables?
"It would be more interesting if the authors trained the NN with the safe filter." Although it is possible, backpropogating through safety-filter NMPC would be very slow. IMHO, we should leave it for the future.
"Main comparison metric (computation time) is not fair, for NN what is more important is generalizability and success rate". How can we show the generalizability? During training and test shall we sample configurations from different sets? @Erfi How do people in RL show the generalizability?

Reviewer 2

"Just one experiment and just one manipulator." Introducing a new robot setup, such as a flexible cart-pendulum, would involve a substantial time investment. Moreover, implementing NMPC for this new configuration would require even more time. @Erfi @RudolfReiter In your opinion does it worth setting up new environment/robot?
"Lack of comparison of imitating the NMPC policy directly versus imitation of dynamics + simple MPC on the learned dynamics." Potentially, NMPC with approximated dynamics by LSTM might be faster. But generating rich enough data, training LSTM and implementing NMPC for LSTM model will take considerable amount of time.

Looking forward for your opinion about the reviewers comments.

Hi Shamil, I have the following comments:

R1.1: This is an easy and important point. I would describe the NN architecture better and experiment e.g. with algorithms like GAIL. R1.2: We could use the safety filter during RL training but without backprop through time. That's too timely to setup. R1.3: I think this comment is of minor importance since we generalize to our random environment. We do not want to apply this to another setup or so.

R2.1: I would not set up another Robot experiment. On the long run we could think of an environment that resembles your new lab robot. R2.2: I think this reviewer never implemented SQP based NMPC. I have never seen a successful application of using an LSTM model in a real-time setup. All i know of are Gaussian Processes within Zeilingers ETH group, but they are far from being real-time capable. Other than that, I know of Fabio Bonassi from Milano, who used LSTMs with casadi and IPOPT for simple systems or people in the US that use sampling-based MPC (MPPI) with learned models. Summing up, I would not consider "learned models in SQP NMPC" as state-of-the-art already.

Hi both,

R1.1: (Hyper-parameters) Yes we can describe out training pipeline and the network architectures and hyper-parameters (it is common to do so in the RL literature since they influence the results quite a lot. We can have ablation studies of different network sizes and other hyper parameters if need be, some are more straight forward than others but they are all doable. The default parameters are usually based of the network architecture and hyper-parameters that were used on the Original papers (GAIL, AIRL, ...) so I am not expecting huge gains my moving away from the default values, but nonetheless we should at least report them.

R1.3: (Generalization) Usually in RL the environment is fixed so there is not much emphasis on generalization in that sense, but they algorithm is usually is trained/evaluated with multiple seeds to show that the algorithm's performance is not dependent of the randomization factor. For generalization/robustness I have seen people change the start-distribution (starting point in LunarLander env for example). More generally we would like the policy to perform well under perturbation in 1) starting distribution 2) transition dynamics (change the flexibility of the arms?) 3) reward distribution (add noise to the reward?) but I haven't seen anyone do all of these in their paper.

R2.1: (Evaluation Environments): Here I think we are limited to the environments where the model of the dynamics is know for the MPC to work, no? What environments do we have that we know the dynamics model? From the RL/IRL side thsi is not a problem since they are model-free (BC, DAGGER, GAIL, AIRL, PPO, SAC...)

shamilmamedov / flexible_arm

Addressing comments of L4DC reviewers #22

Reviewer 1

Reviewer 2