feat: rec iql (OUTDATED)

lbeyers commented 7 months ago

What?

(Draft) Adding a first version of recurrent IQL that performs very well on smax.

Why?

The first version of Q-learning must be added to MAVA so that the rest of the versions can be built off it.

How?

The recurrent IQL file as well as all of its config structure is now running in a periodically-kept-up-to-date "develop" base.

Extra: I am still working on...

The IQL implementation file still needs:

A commenting clean-up
Pmaped updated steps vs rollout length
Logging clean-up
Evaluator implementation I will be working on this while the PR is a draft.
Extra: Questions for reviewers
1. Should I do action masking in the network? Maybe have a discussion about what should and shouldn't be in the network.
2. What level of details should I put into docstrings? If it depends, can you please point out the level of functions you reckon needs argument explanations?
3. (More technical question) Incrementing exploration rate is currently quite coarse because of the parallel execution and env vmapping (the environment step counter t is updated per environment, but we have x devices and y envs on each device, meaning that we actually get x times y transitions per concurrent env step). How much effort should I put into feeding more granular exploration rates to the agents? How much effort should I put into discovering at what point it makes an algorithmic difference?
4. Shared reward: some environments share rewards, some have individual rewards. Currently, whatever rewards are output by the env, we take the mean i.e. currently we enforce shared rewards. How do you want to interface with differences in reward systems?
5. Do we standardise which metrics we care about for logging, and do we decide that dynamically (i.e. have it in the config)? Should I investigate other implementations for how to handle logging? It seems like a question for the whole measure set team. - Question answered by convo w Liam, where he told me more about the wrapper and how metrics are calculated.
6. Do you want to see working runs and if so about how many on which envs? -Partially answered by recent contribution by Wiem, for a standard baseline-check-config.
7. Let's have another discussion on dones. For now, all I can say is I can't see where it's super detrimental to handle trunced transitions differently.
8. What do we want total timesteps to actually mean? Because to me, it should mean total number of env steps taken from which we can learn. In other words, if total timesteps is 100 and we have 5 devices and 5 envs per device (meaning 25 transitions generated per single pmaped-vmaped step) and a rollout length of 2, then we can only run the update function twice (25 x 2 x 2=100).
9. For the evaluator: can we not pass an apply function to the evaluator straight-up? Because inside make_eval_fns, it takes network.apply, and I would rather not pass a whole other network just to make a list of actions into a categorical distribution. This preference is probably influenced by the fact that I haven't worked with the discrete action head before and it just seems easier to pass a general function instead of deciding inside make_eval_fns that the apply_fn is always network.apply.

lbeyers commented 6 months ago

Thanks @WiemKhlifi ! Sorry to say this now only, but this branch is out of date - a lot of the strictly q-learning things are the same but I will need to put out a PR for a different branch when it comes to it!!

lbeyers commented 6 months ago

Updated rec IQL will be going in a different PR - tomorrow!

instadeepai / Mava