Open shamilmamedov opened 10 months ago
Hi Shamil, I have the following comments:
R1.1: This is an easy and important point. I would describe the NN architecture better and experiment e.g. with algorithms like GAIL. R1.2: We could use the safety filter during RL training but without backprop through time. That's too timely to setup. R1.3: I think this comment is of minor importance since we generalize to our random environment. We do not want to apply this to another setup or so.
R2.1: I would not set up another Robot experiment. On the long run we could think of an environment that resembles your new lab robot. R2.2: I think this reviewer never implemented SQP based NMPC. I have never seen a successful application of using an LSTM model in a real-time setup. All i know of are Gaussian Processes within Zeilingers ETH group, but they are far from being real-time capable. Other than that, I know of Fabio Bonassi from Milano, who used LSTMs with casadi and IPOPT for simple systems or people in the US that use sampling-based MPC (MPPI) with learned models. Summing up, I would not consider "learned models in SQP NMPC" as state-of-the-art already.
Hi both,
R1.1: (Hyper-parameters) Yes we can describe out training pipeline and the network architectures and hyper-parameters (it is common to do so in the RL literature since they influence the results quite a lot. We can have ablation studies of different network sizes and other hyper parameters if need be, some are more straight forward than others but they are all doable. The default parameters are usually based of the network architecture and hyper-parameters that were used on the Original papers (GAIL, AIRL, ...) so I am not expecting huge gains my moving away from the default values, but nonetheless we should at least report them.
R1.3: (Generalization) Usually in RL the environment is fixed so there is not much emphasis on generalization in that sense, but they algorithm is usually is trained/evaluated with multiple seeds to show that the algorithm's performance is not dependent of the randomization factor. For generalization/robustness I have seen people change the start-distribution (starting point in LunarLander env for example). More generally we would like the policy to perform well under perturbation in 1) starting distribution 2) transition dynamics (change the flexibility of the arms?) 3) reward distribution (add noise to the reward?) but I haven't seen anyone do all of these in their paper.
R2.1: (Evaluation Environments): Here I think we are limited to the environments where the model of the dynamics is know for the MPC to work, no? What environments do we have that we know the dynamics model? From the RL/IRL side thsi is not a problem since they are model-free (BC, DAGGER, GAIL, AIRL, PPO, SAC...)
I would like to discuss several comments and recommendations provided by the reviewers of L4DC. I believe it's crucial that we carefully consider these points as we work on revising the paper for ICRA. I will briefly outline the relevant comments so that we can discuss the actions we can take to address them in the ICRA paper.
Reviewer 1
Reviewer 2
Looking forward for your opinion about the reviewers comments.