[rllib] tensorflow models don't support object oriented treatment of variables

gehring commented 5 years ago

Describe the problem

With TF 2.0 on the horizon, tensorflow models in rllib should not exclusively depend on variable scopes to reuse variables. Similarly, graph collections should be avoided in favor of explicitly passing variables instances. Both variable scopes and collections will be deprecated in TF 2.0.

The new (and imo, better) way of handling variable sharing is through reusing tf.Variable instances and to organize variables as attributes of some container, i.e., modules, similarly to how pytorch handles variables.

Currently, a significant portion of the TF 1.0 backend has already shifted to this approach, implementing all the necessary tools to maintain support for variable_scopes and tf.get_variable. The public API might be sufficient for ray.rllib to support both approach until support for TF 1.0 is dropped, e.g., tf.function/tf.defun, tf.wrap_function.

This could probably be hacked together with relatively few number of lines of code, but, with a proper refactoring, I suspect that a lot of the distinctions/restrictions caused by tf model vs. pytorch model can be avoided.

On top of improving the pytorch/tf compatibility and future-proofing for TF 2.0, properly implementing this will likely lead to simpler and more readable code which would benefit new users and make implementation of future algorithms less prone to bugs.

Things that would have to be done (in no particular order):

[ ] Stop the use of variable scope to reuse variables across algorithms.
[ ] Refactor the low-level API to allow complete abstraction of the TF/pytorch backend, i.e., reduce the number of TF or pytorch specific (sub-)classes to one for each.
[ ] Object oriented use of rllib.models.model.Model across all implementations
[ ] (maybe?) Handle TF variables serializing/checkpointing using the tensorflow tracking API instead of the in-house solution. See the relevant TF tools for handling variables tf.Module and AutoTrackable (formally AutoCheckpointable)

If there is any interest in this, I am happy to write up a more detailed design proposal and help with its possible implementation.

ericl commented 5 years ago

Thanks for bringing this up @gehring . I think getting TF 2.0 support right will be pretty important, and it would be great to have a proposal here.

If I understand correctly, the main benefit of getting rid of variable scopes is that models can be defined purely with tf.Variable, correct? Whereas in the current state of rllib, models need to use get_variable() for correctness.

If we switched to tf.Variable, then weight sharing in the policy losses can be handled just by calling the equivalent of .forward() on the model class, so something like this (apologies if this is inaccurate, I've only taken cursory look at TF 2.0 APIs).

model = Model()
action1_out = model.forward(input_placeholders_1)
action2_out = model.forward(input_placeholders_2)
sess.run(action1_out, feed_dict={input_placeholders_1: ...})

Also agree TF variable serialization could be improved.

hartikainen commented 5 years ago

I'd be happy to contribute to this as well.

gehring commented 5 years ago

@ericl That is essentially correct! The only potential issues that come with dropping variable scopes is in ensuring proper checkpointing behavior but all the necessary code to handle that is already in tensorflow (easiest done by sub-classing the appropriate class).

The concept of a Model that implements a model.forward() API is somewhat tangential to the handling of variables. Until a light API exists for handling models and their variable creation, like an updated (and therefore simplified) sonnet 1.0, the best option might be to define the simpest API that fits rllib's needs, leaving the custom model sub-class the responsibility of creating variables only when desired.

Consider this bad custom model class which improperly save variables as attributes:

class MyModel(tf.module.Module):

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, x):
        bias = tf.Variable(tf.zeros(self.dim))
        return x + bias

A TF 1.0 veteran might feel powerless seeing this, but for any other programmer familiar with object-oriented programming, this is completely intuitive behavior. While some helpful super-class which manages this automatically would be nice, I don't think a feature complete, complex Model API is required.

I've been swamped recently but I'll have a bit more time in the coming weeks to write up a concrete design we can iterate on. hopefully what I just haphazardly tried to say will be clear then!

ericl commented 5 years ago

@gehring sounds good, I'm looking forwards to it.

One thing I am left wondering after your explanation is how you can do weight sharing without a forward()-like API, i.e., for DQN, and for multi-gpu. Maybe I'm missing something here.

Also, switching from variable scopes to object-like reuse is necessarily a breaking API change, right?

gehring commented 5 years ago

@ericl Sorry, I was a lot less clear than I thought!

One thing I am left wondering after your explanation is how you can do weight sharing without a forward()-like API, i.e., for DQN, and for multi-gpu. Maybe I'm missing something here.

You are correct, rllib will need some API to apply the model on some input, so for that purpose a forward() method would do just fine (or apply or even simply __call__) .

When writing my last reply, I was thinking (but forgot to write) about other behaviors usually implemented by a module API (e.g., pytorch's Module and TF's Layer/Model/Network). Specifically, I was referencing the build() + call() interface commonly found in those API's (see sonnet's AbstractModule or tensorflow's Metric (v2) and Layer).

I'm not trying to argue anything concrete, mostly just thinking out loud. What I was trying to say is that while it could be tempting to include those type of module features in a rllib Model re-design, I think it is best to keep rllib's side of the API as lean as possible.

The example class I wrote was just to make the case that even though users can mess up variable re-use when not given a module API to follow, those kind of errors should be relatively obvious now that everything follows an object-oriented design.

Also, switching from variable scopes to object-like reuse is necessarily a breaking API change, right?

Actually, I believe we can quite easily keep full TF 1.0 compatibility while still switching to object-like reuse! The simplest would be to wrap the old style model constructor in a variable store context, i.e., tfe.EagerVariableStore, or using a variable creator scope, i.e., tf.variable_creator_scope, if finer control on the bookkeeping and variable creation is needed (see how tf.defun(TF 1.0)/tf.function(TF 2.0) captures variables in function calls using a tf.variable_creator_scope backed VariableHolder). Plus, variable creator scopes are part of both the TF 1.0 and 2.0 API.

Here is a quick EagerVariableStore based prototype to support current rllib TF models in a object-oriented API:

import tensorflow as tf
tfe = tf.contrib.eager

class Modelv2(object):

    def __init__(self, num_outputs, config):
        self._num_outputs = num_outputs
        self._config = config
        self._container = tfe.EagerVariableStore()
        self._built = False

    def __call__(self, obs_dict, space):
        # reference to every variable created is saved in self._container
        with self._container.as_default():

            # ensure proper variable scope based reuse
            scope = tf.get_variable_scope()
            with tf.variable_scope(scope, reuse=self._built):

                # old style rllib TF model
                model = ModelCatalog.get_model(
                        obs_dict, space, self._num_outputs, self._config)

       # make sure to reuse variables next time or else TF will complain
       self._built = True
       return model.outputs, model.last_layer

    def variables(self):
        return self._container.variables()

ericl commented 5 years ago

Thanks for the detailed explanation, it makes a lot of sense! Also good to see there is a way to be backwards compatible.

hartikainen commented 5 years ago

I just found out this RFC for tf.Modules. I think these will be super useful when making things more object-oriented.

ericl commented 5 years ago

Btw, any thoughts on eager mode? I haven't thought about this too much, but it seems like we should stick with graph based execution for TF2. Though, things like RNN handling could be easier with eager (but would still require integration work).

gehring commented 5 years ago

As far as I am aware, there is no graph API in TF 2.0. Graph execution is supported through the use of tf.function. As a result of this, everything will be eager compatible. To get the graph-based optimizations, things will have to be wrapped by tf.function. Roughly, to implement an old style class which looks like this:

class OldWay:
    def __init__(self, *args, **kwargs):
        self.session = tf.Session()
        self.my_train_op = create_train_op(*args, **kwargs)

    def train(self):
        return self.session.run(self.my_train_op)

you would write something similar to this:

class NewWay:
    def __init__(self, *args, **kwargs):
        self.args = args
        self.kwargs = kwargs
        self.my_train_func = tf.function(create_train_op)

    def train(self):
        # note the absence of an explicit tf.Session in tf 2.0
        return self.my_train_func(*self.args, **self.kwargs)

There are a few subtleties with tf.function but overall it is pretty straightforward. Tensorflow traces the execution of the function and caches the graph, re-tracing if called with different arguments (to allow things like is_training=True arguments).

NOTE: my example is probably an overly awkward way of implementing this. I wrote it this way to make the mapping from old to new as clear as possible. I would expect implementing everything in a numpy/pytorch style to look a bit different and cleaner.

@ericl I'm sorry for the delays on proposing something concrete. This move to eager and graph tracing is why I think agents can be made invariant to pytorch vs tensorflow if designed carefully. I will try to have something written that we can iterate on within a week or so.

ericl commented 5 years ago

As far as I am aware, there is no graph API in TF 2.0. Graph execution is supported through the use of tf.function. As a result of this, everything will be eager compatible. To get the graph-based optimizations, things will have to be wrapped by tf.function. Roughly, to implement an old style class which looks like this:

Right, so my question is whether we have to do anything here (or where there is a concrete advantage to supporting eager in places). Since we need TF 1.x compat, I don't see us adopting new constructs for the loss definitions unless they also work in 1.x. However it sounds like switching models is a no brainer since it's cleaner and also backwards compatible.

@ericl I'm sorry for the delays on proposing something concrete. This move to eager and graph tracing is why I think agents can be made invariant to pytorch vs tensorflow if designed carefully. I will try to have something written that we can iterate on within a week or so.

No problem. I still don't quite understand what you mean by making it invariant to TF vs pytorch though. Isn't this already the case since all the TF specific code is contained in TFPolicyGraph, unless you mean the TF vs PyTorch model API in particular?

gehring commented 5 years ago

Right, so my question is whether we have to do anything here (or where there is a concrete advantage to supporting eager in places). Since we need TF 1.x compat, I don't see us adopting new constructs for the loss definitions unless they also work in 1.x. However it sounds like switching models is a no brainer since it's cleaner and also backwards compatible

In the long run, I would strongly encourage a full switch to the TF 2.0 way. It will be easy to support TF 1.x with a TF 2.0 design but I don't see how the other way around will be possible. Though, I agree with you that a full switch is not necessary yet, but now is probably a good time to start thinking about what ray + TF 2.0 will look like to make sure that 1) new classes play nice together and 2) switching requires as little refactoring as possible. In the long run, I would expect the code base will be simpler to maintain (eager/TF2 encourages better coding practices) and more likely to be compatible with future TF 2.x features.

All that being said, there is no need to make the jump all at once. I am happy focusing on solving this issue (i.e., OO tensorflow variables) before considering broader TF 2.0 support.

I still don't quite understand what you mean by making it invariant to TF vs pytorch though.

I'm thinking along the lines of the following. In a perfect world, we can have all of the tensor operations depend on operator overloading, allowing the agent/policy/loss code to take TF tensors (from a TF model) or PyTorch tensors (from a pytorch model) as input.

Of course, the reality is that the API differences make this impossible but with a TF 2.0 approach, these differences are going to be minor and mostly limited to different naming conventions and variable treatment. This opens up a lot of opportunities for code re-use that wouldn't normally be considered before. However, the devil is in the details. Better code re-use could be impossible or add an undesirable amount of complexity so maybe I won't have anything better to propose after diving in!

Naturally, I don't intend to implement anything in that regards until I have a design worked out that you all feel is worth the effort. (As a side note, with rllib growing in complexity and the release of TF 2.0 close by, I feel like this is a good time to make sure rllib's foundation is as flexible and easy to maintain as possible.)

ericl commented 5 years ago

I definitely see the value of thinking about 2.0 support more holistically. Longer term there are also interesting ways this could tie in with differentiable ray actor methods, which is another experimental Ray+TF2 project).

I'm thinking along the lines of the following. In a perfect world, we can have all of the tensor operations depend on operator overloading, allowing the agent/policy/loss code to take TF tensors (from a TF model) or PyTorch tensors (from a pytorch model) as input.

Oh I see, I can see this being hard to pull off.

Btw, another idea I've kicked around is whether we can embed TF models in torch graphs and vice versa. This limits the scope of interactions between the frameworks and might make interop more tractable. I also think this is what most users mean when they want "PyTorch" support -- just the ability to plug in custom models written in torch.

In principle this is doable with custom TF/torch ops (one incomplete example is here: https://github.com/tensorflow/cleverhans/blob/master/cleverhans/utils_pytorch.py). I never got that far into investigating these however, there are at least issues around handling gradients and how different TF optimizers work.

gehring commented 5 years ago

Btw, another idea I've kicked around is whether we can embed TF models in torch graphs and vice versa. This limits the scope of interactions between the frameworks and might make interop more tractable. I also think this is what most users mean when they want "PyTorch" support -- just the ability to plug in custom models written in torch.

This should be fairly easy to do. Your comment gave me the following idea. We implement a base Model class with an optional overridable gradient function. If all operations are compatible with tensorflow, then nothing special is required apart from overriding the apply function (exact names TBD). Otherwise, the custom gradient function needs to be overwritten.

Additionally, we can use this to implement a pytorch model which provides the gradient implementation for pytorch functions. This should be straightforward using tensorflow's tf.custom_gradient. Chances are this can be done such that the op is closed on itself, supporting higher order derivatives though whether or not it will make the code overly complex remains to see (though that wouldn't affect the public API).

Once we've nailed down the Model method signatures, I can go ahead and implement it. It would be good to keep the model API light, limiting the class to being a generic parametric function. That would probably be the most intuitive to users wishing to implement custom models. Here is a first go at it:

class Model:

    def __init__(self, num_outputs, **kwargs):
        # To be overwritten as needed. Model specific config arguments 
        # are passed as keyword arguments.
        pass

    @abc.abstractmethod
    def apply(self, input_dict):
        # This must be overwritten. The expected behavior should be the
        # same as current `_build_layers_v2` but with a single returned tensor
        return NotImplemented

    def backwards(self, dout):
        # This should be overwritten if special gradient functions are needed
        pass

    def variables(self):
        # returns all variables as tensors (this will be implemented)
        pass

    def trainable_variables(self):
        # returns all trainable variables as tensors (this will be implemented)
        pass

The current API seems to have a lot of different methods and returns two tensors for _build_layers_v2. Do we need all the other stuff or can it be worked in through other means? For example, agents that expect models which provide values and features could simply provided a subclass with two call functions, e.g., value_function and features. This would make it very clear what the agent expects without cluttering the API for agents that don't need it.

Otherwise, the biggest difficulty I anticipate will be integrating OO models it into the current agents. We might be able to cache calls to the model catalog based on the scope in which is it was called and use that as a mechanism to support scope based agents. This would allow for agents to be switched over one at a time. What do you think?

ericl commented 5 years ago

Once we've nailed down the Model method signatures, I can go ahead and implement it.

Looks promising. I would try implementing a couple cases to see if there are any issues:

simple fcnet
simple fcnet with (layer sharing) value function
RNN model
model with custom loss (see: custom_loss.py example)

For example, agents that expect models which provide values and features could simply provided a subclass with two call functions, e.g., value_function and features

You'd need to handle layer sharing somehow right? I don't know how that would work exactly, would you save the outputs of parts of apply() internally? It would be awkward to pass the output of features() to e.g. value_function() and loss() and policy_output(). Maybe a single apply method that returns a dict of outputs makes sense.

We might be able to cache calls to the model catalog based on the scope in which is it was called and use that as a mechanism to support scope based agents. This would allow for agents to be switched over one at a time. What do you think?

That makes sense. Another option is to not support "new-type" models until the agent has been ported. So during the transition period some agents would support both new/old, others only old.

gehring commented 5 years ago

@ericl I'm making a bit of progress here but before I can continue I need to know which versions of tensorflow are Ray and RLLib trying to support. Several significant additions to the API and backend were made along the way from 1.11 to 1.13 which would make some parts of the code cleaner but aren't necessarily mandatory.

RE: pytorch in tensorflow idea

I was able to write out a prototype for embedding pytorch functions within a TF graph, but we might not be able to properly support tensorflow's "implicit" variable based API (i.e., stateful dependency vs. function arguments) without some extra work. It is definitely possible but the only elegant solution I can think of would require implementing a new custom tf.Variable class for the TF optimizers to behave as expected.

On the surface, implementing a custom TF variable class doesn't seem to be too difficult, but I am worried there could be a lot of caveats since tf.Variable was never intended to be an interface (also, since the addition of resource variables in TF, the backend code is fairly messy and complex). However, there is a good chance that the tensorflow devs might be interested in enabling embedded pytorch modules so we might be able to get someone familiar with the tf.Variable code to checkout any prototype we have.

The good news is that TF 2.0 is suppose to introduce a proper public tf.Variable interface. If we run into weird corner cases in TF 1.0, we might be able to keep the feature experimental until TF 2.0 is released, making it a proper TF 2.0 exclusive feature then.

Given this would require a bigger effort than implementing a method or two, this probably merits its own issue which we can address once we've committed to the new model API. Do you think there is a strong enough interest in this feature from users or you and the rest of the ray/rllib devs? Let me know what you think!

ericl commented 5 years ago

@gehring for a new feature, I think it's fine if it only works on the newest versions. However we shouldn't break compatibility overall for older models / older TF versions, since some RLlib users are stuck on quite old TF versions.

re: pytorch in TF

I think this would be useful for many users, and we would be interested in adding this to RLlib (and I suspect this would be value outside of RLlib as well). However it sounds like there are a bunch of potential issues like the Variable handling. Also, how about GPU acceleration? Seems like you would be potentially copying data from CPU<->GPU a lot.

Agree on making a separate issue for this.

gehring commented 5 years ago

some RLlib users are stuck on quite old TF versions.

How old are we talking?

The reason this is important is that a lot of tensorflow's function tracing functionality received a considerable amount of polish in the last two releases, as well as bug and inconsistencies fixes. Using tensorflow's function wrapping methods would allow for a concise, clean implementation and would spare us from re-implementing lots of little things, e.g., variable lifting, graph re-use, with the added bonus of supporting eager style code.

If you give me a number, I can check what TF API calls are available and make an educated guess as to what would be the easiest way to organize and implement model 2.0 with that version. Maybe, if needed, we can "negotiate" the version up until we find a good balance between supporting older versions and a more elegant implementation. If I had my way, I would force 1.13.1 but that is probably not reasonable! ;)

Also, how about GPU acceleration?

In my mind, any solution requiring memory copy of variables or gradients, regardless of CPU vs GPU, is a non-starter since I anticipate the performance hit would simply be too big to be practical. Though, I guess if the opportunity arises, it might be worth benchmarking to confirm its magnitude.

The solution I have in mind wouldn't require variable copies but I am not quite sure what is the best way to avoid it for the gradients. I haven't really looked into it yet but it is possible that a custom tf.Tensor could do the trick in the same way a custom tf.Variable solves the variable handling. Otherwise, we would probably have to create a "native" tf.Tensor which shares its data array with pytorch in such a way that everything plays nice with any caching mechanisms.

ericl commented 5 years ago

How old are we talking?

Well, we have stubs to support 1.5.0... I think it depends on how bad it is to support older versions. I would prefer not to depend on eager / function tracing right now since it's quite new and we can get the majority of the benefits without fully porting to that model.

The solution I have in mind wouldn't require variable copies but I am not quite sure what is the best way to avoid it for the gradients.

Yeah, perhaps copying intermediate values isn't that bad. What you pass into custom_gradient() isn't that large, iirc, mostly it would just add some latency. We already have to pay the copy cost once for ingesting observations, so this would be at worst 3x that.

ericl commented 5 years ago

Btw, I had an old issue for the embedding torch models: https://github.com/ray-project/ray/issues/3885

gehring commented 5 years ago

@ericl I apologize for the I step forward, two steps back that I've been doing, but how much push back were there be against adding Sonnet as a dependency for RLLib and using that as a back-end for models? (This would also require the minimum version of TF to be 1.5)

The reason I ask is that 1) it offers all the tools necessary for us to implement OO models while still fully supporting graph mode and variable scopes, 2) Sonnet 2.0 is currently in development which is likely to make switching to TF 2.0 very easy once the time comes, 3) we could easily support the same (or at least very similar) model API for backwards compatibility by implementing a custom module.

Otherwise, what's special about TF 1.5? In this version, they have some helpful tools for re-using graphs and they have the very powerful tf.make_template function (added in 1.4). Even if we don't want to use Sonnet, it might be for us to consider using a version that has it rather than reimplementing a lot of existing functionality which.

ericl commented 5 years ago

Hm, I don't think we should adopt a dependency on any particular library (e.g., Sonnet) for something as core as models. The issues are that we would need to keep up with projects for development, and also users have different choices of third-party model libs. I would rather bump the TF dependency requirement if needed.

I think some ideal compatibility story would be as follows, assuming older versions can't provide full compatibility:

TF < version ZZ not supported.
TF < version YY only works with RLlib v1 models
TF >= version YY works with both RLlib v1 and v2 models

In addition we can say only certain algorithms support v2 models initially.

gehring commented 5 years ago

I understand being reluctant to having a third-party dependency in model but let me try to make a slightly stronger case for Sonnet. First, for context, Sonnet is the in-house tools developed by DeepMind and is widely used internally. The reason Sonnet was created rather than simply using other TF solutions (and the reason why Sonnet 2.0 is being developed) is because other solutions were too "intrusive" and limiting.

The core of Sonnet is relatively small and lightweight. The only component of it that we would need to touch is the AbstractModule. The tf.Module that was referenced earlier is TF's 2.0 spiritual successor. The AbstractModule class essentially does exactly what we would like to do with the new model class.

I've used Sonnet before and, from my experience, the buy-in is almost zero. It is compatible with all of tensorflow and doesn't enforce any restrictions on its inputs or outputs so it can be integrated anywhere. On top of that, we should easily be able to hide it away from the users if we want to, simply using it to wrap the old style calls.

The reason why I consider this a good move is that it would allow us to focus our efforts on adapting the core code and agents to using OO models right away instead of first having to implement, debug and maintain an in-house solution for the TF OO models. Once everything is OO, we will have a better idea about what the model class should be. We could then implement our own model backend to satisfy all our use cases if Sonnet is limiting us in some way.

You could see Sonnet more as a medium term fix than a full commitment. Once the time comes to jump to TF 2.0, we can drop Sonnet if we want to. The way TF 2.0 will work means that most of what we need is provided by tf.Module.

Again, I completely understand not wanting to go that route, but I would recommend seriously considering it. If you are willing to entertain the idea, we could pick an agent (maybe DQN?) and I can write a prototype DQN using an OO model backed by Sonnet, and go from there. What do you think?

ericl commented 5 years ago

If you are willing to entertain the idea, we could pick an agent (maybe DQN?) and I can write a prototype DQN using an OO model backed by Sonnet, and go from there. What do you think?

Yeah, I think having two concrete prototypes to choose from would be the way to go here. That would let us weigh the complexity introduced of supporting it ourselves, vs using a third-party library.

I think the bar would have to be pretty high to add a new dependency (something like saving 500+ lines of code).

By the way, Sonnet is Apache 2 licensed, so we could potentially also "borrow" just the necessary compatibility code for 1.x without acquiring a hard dependency.

joneswong commented 5 years ago

Hi Eric, in our company, users rely on diverse TFs ranging from v112 to v180. I am not familiar with the new features, so I am sorry that I cannot give any constructive comment now. However, I had a discussion with one TF developer and he told me that we'd better make some comprehensive survey without committing until the emergence of TF RC version. I will continue to learn the TF2.0 in the following days.

ericl commented 5 years ago

@joneswong what do you think about dropping support for TF < 1.13?

joneswong commented 5 years ago

@joneswong what do you think about dropping support for TF < 1.13?

I prefer to support that because there are many old fashion docs, codes, and users in my company. BTW, scoping based on tf.make_template like Sonnet and TensorForce do may rise the bar of development according to my experience (at least in my company, most users do not have such a capability).

ericl commented 5 years ago

FYI, https://github.com/ray-project/ray/pull/4795 adds an "eager-style" definition of losses. This is done by auto-generating placeholders at runtime. Presumably with TF eager we could have the same style of code but don't even need auto-placeholder generation.

ericl commented 5 years ago

This has been combined into https://github.com/ray-project/ray/issues/4134

ray-project / ray

[rllib] tensorflow models don't support object oriented treatment of variables #4134

Describe the problem