Proposal: Markov decision process modeling language

femtomc commented 4 years ago

Here's a version of the below proposal with rendered math: proposal

Background

I'm curious about the state of deep RL research in Julia at the moment, so I sat down to review some of the Markov decision process/environment libraries in Julia. It appears that the JuliaReinforcementLearning organization handles code related to deep RL in Julia. However, I am not happy with their interfaces for two reasons:

The main codebases closely copy OpenAI's gym interfaces from Python, which are focused on "flat" Markov decision processes and partially observable Markov decision processes. These are unnecessarily restricted - a Markov decision process can be specified by a triple <$P(O | S)$, $P(A | O)$, $P(S | A, S)$> where each likelihood can be an arbitrary probabilistic program (with conditions that match spaces across the triples). Expanding beyond these paradigms to specify e.g. open-universe POMDPs or hierarchical POMDPs is difficult with these APIs - they are most useful when the action/observation loop is flat and doesn't call to sub-MDPs. Expanding the infrastructure to support these forms requires quite a bit of hacking.
There is no language abstraction - new environments must be created with raw structures, no helper macros.

Proposal

Markov decision processes are probabilistic programs - this implies to me that implementing a Markov decision process specification language can re-use a large portion of existing infrastructure in Gen. In particular, the trace of a Markov decision process is equivalent to a probabilistic program trace with the addition of a scalar reward function evaluated on the product space $A \times S$ at each time step $t$. The structure of an MDP as a graphical model is easiest to see in the attached MDP image (which contributors here are likely familiar):

sergey_MDP

I foresee a few interesting things occurring with the development of this sort of extension language:

Interfaces to allow model-based RL to include interesting components, like probabilistic program synthesis and programmable inference over models.
Flexible specification of nested MDPs or MDPs with action and observation product spaces (i.e. open-universe MDPs).
Priors in deep RL - agents can incorporate arbitrary generative function structure as policy priors. With compatibility between Gen + Flux/TF and the ability to specify nested MDPs (which I believe is similar to nesting of generative functions), one can imagine policy classes which solve MDPs in a hierarchy.

I can sharpen this proposal up a bit to be more clear about some of the terminology I'm using. But the basic proposition is that having a flexible specification language for MDPs in a general purpose PPL is very powerful and I think Gen is well-over halfway there.

Initial work

I'm starting a few experiments with the GFI to see what sort of constructs and interfaces are required. In gym-like environments, you typically have "specs" which interact with the environment and change the state/record observations according to the transition and observation models. One might imagine writing a separate construct @act for actions and @observe for observations but I'm not yet convinced that you need any more constructs besides @trace besides for visual convenience and reasoning convenience.

I'm writing a few POMDPs as monolithic generative functions to see what the right abstractions are (in particular, to record acts and obs and rewards correctly). One big question I have is the relationship between the MDP as a whole and the policy class. One initial thought I had was to write a new type of GenerativeFunction called MDPGenerativeFunction where the args are Tuple{P, Vararg{Any}} where P <: GenerativeFunction. In this case, P is the policy (another generative function) - I'd really like to enforce interfaces on the spaces of actions and observations when a user is writing these functions.

Reward specification and tracing

Additionally, I need to think through reward specification. This is an interesting problem - ideally you want to specify a reward function with the spaces in mind (i.e. $S$ and $A$) which gets tabulated at every @trace or @act call. This is exactly what the trace does when tracking the log probs, so I think extending the tracing mechanism to support recording another function call is easy - but allowing the user to specify the function is interesting. I need to look at the infrastructure to see if it's possible to automate this.

ztangent commented 4 years ago

Thanks for this proposal @McCoyBecker! Before responding in more detail have you taken a look at the JuliaPOMDP organization and the POMDPs.jl package? They have two pretty nice interfaces for defining new MDPs and POMDPs that I think are pretty flexible, and I was considering possibly wrapping them within a Gen interface for some agent modelling and goal inference work I'm doing.

ztangent commented 4 years ago

(Also, @McCoyBecker I see that you're working at CRA -- I've been meeting w some of your colleagues working on the COLTRANE project, and so if this open universe MDP stuff is by chance related to that, I'd be happy to touch base in person!)

femtomc commented 4 years ago

@ztangent I did not investigate POMDP.jl - let me put this on hold while I do so.

(This is actually not directly related to COLTRANE but is instead part of other projects we are spinning up, as well as an ongoing project under DARPA OFFSET. However, it is possible the ideas might influence work on COLTRANE, if they become sufficiently advanced. We actually dev’d some of the capabilities I hinted at in Python and I’m beginning to dev them in Julia for other research purposes.

However, if you’d like to discuss this work, feel free to reach out! I am familiar with the work on COLTRANE, and it is very thought provoking.)

femtomc commented 3 years ago

This would still be cool -- but I think it's necessarily something which would be handled outside of Gen.jl (potentially by a language package). This may also have already been handled by someone else.

probcomp / Gen.jl