Open araffin opened 5 years ago
Hi, Thank you for your interest and sorry about the delayed reply. The notebook is a great idea! You can make a pull request and add that in the markdown or however you see fit.
The hovering control of the flapping wing robot is still an open problem, so I just have a feedback controller for the demo, which is already not easy to achieve. The system is extremely unstable so it is very difficult to control.
The maneuvering is trained for 5 million steps, using default hyper parameters with reward scaling of 0.05. Yes the inverse creates a huge reward at the target position and pose which helps attract the robot to converge better.
I'll fix the reward to be an array.
I'll post the training performance and some demo clips in the next update.
@araffin Hi, thanks for setting the colab notebook. As I explain in #2, pydart2 is deprecated for the latest dartsim version v6.9. To run successfully one needs to install dartsim<=6.8.2 from source. This is indicated in my last comment in #2 . Could it be possible that you update the notebook to reflect this change and be able to run it successfully? Thank you.
Could it be possible that you update the notebook to reflect this change and be able to run it successfully?
Well, you can copy and update the notebook yourself (and post the link here afterward ;) ). I don't have the time to do that now.
it seems that the error of the reward type still exist now, the type is np.ndarray most of the time. BTW, the maneuver env seems to train the model to fix the out of ARC controller. I'm wondering that if there is a successful example of training without a feedback controller, or it's just too difficult to do such a control?
Hello,
I set up a colab notebook, so you can train your agents online on flappy envs ;) : https://colab.research.google.com/drive/13mJ1bU2tKVurG9chNhM0U7ivgVKlzPu7
Also, I have some questions about the training:
It seems, that your maneuver does not follow gym interface, the reward must be a float, but is currently a numpy array (I had to use a reward wrapper to catch up the error).
I would also normalize the reward using the opposite of the cost instead of the inverse (otherwise the reward magnitude is really huge), and maybe add a "life bonus" (+1 for each timesteps) for the hover env, see here for an example ;)