tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.38k stars 74.17k forks source link

Possibility to compile python code when @tf.function is used in TensorFlow 2.x #43800

Open soroosh-rz opened 3 years ago

soroosh-rz commented 3 years ago

System information

Describe the feature and the current behavior/state.

With TensorFlow 2.x, eager mode is enabled by default, so it is essential to use @tf.function to speed up the performance. tf.function decorator needs access to the python source code, therefore it makes it impossible to compile our python code and create an executable python build (binary) because the decorator (autograph) always needs access to the python source code. This issue makes it impossible to obfuscate the code and can create a security concern at some situations.

Will this change the current api? How?

Probably yes, but if there is any workaround with the current api, please let me know!

Who will benefit with this feature?

Anyone who deploys a deep learning solution in the customer site or cloud or willing to create code obfuscation

mdanatg commented 3 years ago

The recommended way to deploy learning solutions is to use tf.saved_model. That generates a GraphDef protobuff which you can then load outside a Python runtime, for example from a C++ binary. The graph is a lowering of the original Python code, and generally not possible to reverse.

That said, tf.function doesn't directly support ahead of time compilation into an executable binary, and that's a feature to consider.

soroosh-rz commented 3 years ago

Thanks for the comment. tf.saved_model can definitely help when we have done the training stage but in case we have to do on-premise training we can not obfuscate the code and when we have already exposed the code, tf.saved_model can not help that much! I am not sure if there is a workaround at the moment but if a new feature can be added (dealing with @tf.function) for the purpose of code obfuscation (or compiling) it will really useful especially for security concerns!

mdanatg commented 3 years ago

@soroosh-rz could you clarify the bit about in-premise training? Did you mean that you cannot use saved_model before training is complete?

soroosh-rz commented 3 years ago

@mdanatg Sorry I did not explain it well but the answer is yes! If I want to elaborate on it, we can consider the case when we have spent time and we have developed a general good model for a specific purpose or platform! Now we have to go to the customer (or several customers) site that has stored the data on-perm in order to perform training to fit our model first to their data. In this case you may not want to expose your code but only adjust (or train) your model to the new data. In tech startups and consulting companies this can be common case that you agree to deploy your models (and the downstream application) but not giving all your codes and the mechanics behind it. With Tensorflow 1 and the session based approach it is possible to compile the python code and create binary. But if we use tf.function decorator in Tensorflow 2 it is impossible to compile or obfuscate the code. and if we don't use tf.function in eager mode, it becomes slow.

mdanatg commented 3 years ago

@ccrusius to confirm whether resuming training is supported after loading from a saved model

ccrusius commented 3 years ago

In the general case I think the answer is "no," as we don't support saving optimizers and their state, for example.

bhack commented 3 years ago

In the general case I think the answer is "no," as we don't support saving optimizers and their state, for example.

Yes we are still waiting to add a warning with https://github.com/tensorflow/tensorflow/pull/42846

limhj23 commented 2 years ago

Has it been solved by any method? I am exactly in the same situation as @soroosh-rz and looking for a way to bypass.

mdanatg commented 2 years ago

I've been able to successfully retrain a saved Keras model, so I think that is a valid alternative at the moment. @tomerk for more info; I think there may be some tutorials around.

I think this works as a low level with tf.function as well, so long as you only save the model itself, without any of the training scaffolding, and re-create any optimizers/gradients after loading.

limhj23 commented 2 years ago

I've been able to successfully retrain a saved Keras model, so I think that is a valid alternative at the moment. @tomerk for more info; I think there may be some tutorials around.

I think this works as a low level with tf.function as well, so long as you only save the model itself, without any of the training scaffolding, and re-create any optimizers/gradients after loading.

I don't understand. How would you train further from the saved Keras model without having tf.function in the training step?

Also, this issue occurs in the same way when you have to map with parse function in creating TFDataset. So it really needs a way to work around with tf.function

mdanatg commented 2 years ago

Ah, you're right, your use case is different.

We might be able to resolve the part about autograph: at runtime, autograph saves a temporary file with the transformed Python, and that could be used by the compiler instead.

Are you using open-source tools for the compilation? If so we could look into adding support for it.

limhj23 commented 2 years ago

I am using pyarmor to obfuscate my python script. So my brief version of code is like below

class Trainer():
    def __init__(self, x):
        """ skip this part for brief explanation """

    def train_step(self):
        @tf.function
        def default_grad(x, y, optimizer, train_metric):
            with tf.GradientTape() as tape:
                logits = self.model(x, training=True)
                loss_value = self.loss_fn(y, logits)
            grads = tape.gradient(loss_value, self.model.trainable_weights)
            optimizer.apply_gradients((grad, var) for (grad, var) in zip(grads, self.model.trainable_weights) if grad is not None)
            train_metric.update_state(y, logits)
            return -loss_value

        return default_grad

    def test_step(self):
        @tf.function
        def default_test(x, y, val_metric, val_acc):    
            val_logits = self.model(x, training=False)
            loss_value = self.loss_fn(y, val_logits)
            val_metric.update_state(y, val_logits)
            val_acc.update_state(y, val_logits)
            return -loss_value

        return default_test

    def do_train(self):
        train_fn = self.train_step()
        test_fn = self.test_step()
        for i, (x_batch_train, y_batch_train, _) in enumerate(self.train_dataset):
            if i == num_steps:
                break
            train_loss = train_fn(x_batch_train, y_batch_train, optimizer, train_metric)
            if checkpoint:
                for j, (x_batch_val, y_batch_val, _) in enumerate(val_dataset):
                    val_loss = test_fn(x_batch_val, y_batch_val, val_metric, val_acc)

and there are other scripts that makes a component together with the script above. These scripts are obfuscated together through Pyarmor. At runtime, there is a mother process .py file that holds every deep learning computation, and this mother process calls the Trainer above as its subprocess. So this mother process file is what the pyarmor package decodes at the runtime.

I would really appreciate your help and advice

mdanatg commented 2 years ago

Thanks. Basically this would require autograph to work in offline mode. It doesn't support that at the moment, so it would need some improvements to work. I don't think it can easily work with the current API.

The only workaround would be to turn autograph off (i.e. @tf.function(autograph=False)), but that may require rewriting your control flow by hand (i.e. you'd need to write tf.while_loop instead of for i in tf_range. It's worth a try.

For adding the necessary autograph support --

Currently, autograph reads the python function, and generates a temporary Python file with the transformed code, which it then loads. What we'd need is to be able to tell autograph to run in two modes: (1) generate those files in your sources directory, (2) run in a mode which just imports those sources instead of transforming dynamically again.

Basically, in mode (1), this call would need to save the Python code in your working directory. In mode (2), the same code would need to import and then call the function saved in mode (1). That should work with the pyarmor limitations.

limhj23 commented 2 years ago

Thank you for your explanation. It sounds this job is going to be tough to me but let me try to make it clear. So, what I understood is that there is one option that is turning off the autograph while rewriting the control flow, and there is another option that is keeping the autograph by doing mode 1 and 2. In mode 1, would the Python code, to be transformed through the call, be the entire script with trainer or just the tf.function part?

mdanatg commented 2 years ago

Yes, sorry you don't have any good options right now.

The second option would likely need us to implement some improvements, otherwise it would take quite a bit of work for you to complete.

For mode 1, it would be just the tf.function parts, as before. The way I think it could work is that you'd run the python script once (without obfuscation) to trigger the transformations (e.g. a fake training run or something in these lines); after that run, a bunch of new Python source files would appear next to your code, so that pyarmor can pick them up and compile them into the binary. For example if you had a foo/bar.py with a @tf.function around a def fn() function, you'd see a new foo/bar_fn.py file or something in those lines.

limhj23 commented 2 years ago

Hi @mdanatg Back again to ask for your help. I have been away from this issue for a while due to other issues, and then now I started to work on this again recently. So, the mode 1 and 2 explained above, I managed to get the code converted by the transformation call which looks like below. To help you to understand better, it is about generating anchor boxes and here I am trying to map images with their anchors into TFDataset.

# coding=utf-8
def outer_factory():

    def inner_factory(ag__):

        def tf__encode_batch(self, batch_images, gt_boxes, cls_ids, idx_info):
            with ag__.FunctionScope('encode_batch', 'fscope', ag__.STD) as fscope:
                do_return = False
                retval_ = ag__.UndefinedReturnValue()
                images_shape = ag__.converted_call(ag__.ld(tf).shape, (ag__.ld(batch_images),), None, fscope)
                batch_size = ag__.ld(images_shape)[0]
                labels = ag__.converted_call(ag__.ld(tf).TensorArray, (), dict(dtype=ag__.ld(tf).float32, size=ag__.ld(batch_size), dynamic_size=True), fscope)

                def get_state():
                    return (labels,)

                def set_state(vars_):
                    nonlocal labels
                    (labels,) = vars_

                def loop_body(itr):
                    """ something long inside """
                try:
                    do_return = True
                    retval_ = (ag__.ld(batch_images), ag__.converted_call(ag__.ld(labels).stack, (), None, fscope), ag__.ld(idx_info))
                except:
                    do_return = False
                    raise
                return fscope.ret(retval_, do_return)
        return tf__encode_batch
    return inner_factory

I found the ag__ is a reference to the Autograph module, but I am not sure how I should do to replace the original py_function with this code above.

Also I tried the first method that is converting the function into TF-specific control flow and I am struggling with it too. But I am not sure it would be okay to ask it here because it will be more specifically about using TensorArray in the graph execution rather than the autograph.

mdanatg commented 2 years ago

If it's a one-off, I might be able to help translate the code. If you're looking for a more general solution, read on -

To replace the original py_function function, the most common method is to use a decorator. That's what tf.function does, for example.

So technically you could write a wrapper function like:

def cached_autograph(f):
  if <mode 1>:
    ret_fn = tf.autograph.to_graph(f)
    <save ret_fn to persistent file>
  else:
    ret_fn = <import persistent file>
    (re-attach ag__, closure, globals to ret_fn)
  return ret_fn

@tf.function(autograph=False)
@cached_autograph
def encode_batch(...)

Side note: if you'd be willing to try rebuilding TF from source, it might be easier to do everything inside the existing infra, because it already does things like attaching the closure and globals to the new function. The integration point would be somewhere around here:

https://github.com/tensorflow/tensorflow/blob/52e8fcf5b5db8a830e4dcde1322ae01ca53dd42c/tensorflow/python/autograph/pyct/transpiler.py#L464

def transform_function(self, fn, user_context, precompilation=None):

# ... existing code ...
logging.log(1, '%s is not cached for subkey %s', fn, cache_subkey)

# >>> begin new code
if precompilation == 'use':
  nodes, ctx = <import from persistent file>
else:
# <<< end new code

  # TODO(mdan): Confusing overloading pattern. Fix.
  nodes, ctx = super(PyToPy, self).transform_function(fn, user_context)

# >>> begin new code
  if precompilation == 'generate':
    <save nodes and ctx to cache file>
# <<< end new code

parser.unparse might be of help transforming nodes into source code.

ricvo commented 2 years ago

@mdanatg Thanks for the explanations, when you suggest to rebuild tf, how one would go to pass precompilation to the transform_function, is the tf.function decorator somehow calling transform_function under the hood or is this another decorator to call additionally, similarly to cached_autograph that you suggest above?

A clarification: AFAIK when (for example) a different numpy array or float value are passed to a tf.function then the function is retraced (a different graph is generated potentially), does this means that only the traces explicitly done in the precompilation='generate' phase from source code will be available later on in the compiled library in precompilation='use' mode? Or is tracing not influenced by this and can be done later by the compiled code, could you clarify this point?

Is there any plan to increase support for the compilation of tf.functions in the future?

Thanks

mdanatg commented 2 years ago

is the tf.function decorator somehow calling transform_function under the hood

Yes, tf.function calls autograph, from here, and autograph eventually calls transform_function.

does this means that only the traces explicitly done in the precompilation='generate' phase from source code will be available later on in the compiled library in precompilation='use' mode

It's true, autograph is called once for each retrace. However, it only really needs to run once, because the transformed code is reused for all future traces. There is a strong assumption that the precompiled code is invariant to the arguments of the function. For that reason, transform_function can afford to employ a cache, and that means that as long as you trace once, the source code should be available for all future retraces.

Is there any plan to increase support for the compilation of tf.functions in the future?

Yes, we hope to offer a more powerful interface for compilation in the future, though at the moment we're focused more on resolving some of the semantic inconsistencies of tf.function.

ricvo commented 2 years ago

Thanks a lot for the very clear answers, I would like to attempt what you suggest, how would you proceed to implement the following steps: <save nodes and ctx to cache file> and nodes, ctx = <import from persistent file> are there already some utils to save those that I could use ?

I was also trying an alternative route, of using tf.autograph.to_code to output source code interpretable and then compile that with tf.function(autograph=False), but I have a problem that I do not know how to define ag__ e.g.

ag__ = tf.autograph

@tf.function(autograph=False)
def tf__tanh_loop(x):
    with ag__.FunctionScope('tanh_loop', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as fscope:
        do_return = False
        retval_ = ag__.UndefinedReturnValue()

        def get_state():
            return (x,)

        def set_state(vars_):
            nonlocal x
            (x,) = vars_

        def loop_body():
            nonlocal x
            ag__.converted_call(ag__.ld(tf).print, (ag__.ld(x),), None, fscope)
            x = ag__.converted_call(ag__.ld(tf).tanh, (ag__.ld(x),), None, fscope)

        def loop_test():
            return (ag__.converted_call(ag__.ld(tf).reduce_sum, (ag__.ld(x),), None, fscope) > 1)
        ag__.while_stmt(loop_test, loop_body, get_state, set_state, ('x',), {})
        try:
            do_return = True
            retval_ = ag__.ld(x)
        except:
            do_return = False
            raise
        return fscope.ret(retval_, do_return)

AttributeError: module 'tensorflow._api.v2.autograph' has no attribute 'FunctionScope'

I tried also with

ag__ = tf.autograph.operators

but got an error as well

AttributeError: module 'tensorflow.python.autograph.operators' has no attribute 'FunctionScope'                                               

I'd be happy to contribute with a pull request if it could be useful

ricvo commented 2 years ago

@mdanatg some further precisations regarding my previous message. I noticed that:

  1. nodes is a gast.gast.FunctionDef object, as you mentioned above it can be transformed to source code with parser.unparse(nodes). In the specific case of the tanh_loop function aboveit leads to the same code that tf.autograph.to_code produced, with the only difference being in the fuction name (tanh_loop instead of tf__tanh_loop). Output of parser.unparse(nodes) below:

    # coding=utf-8                                                                                                           
    def tanh_loop(x):                                                                                                        
    with ag__.FunctionScope('tanh_loop', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_f
    eatures=(), internal_convert_user_code=True)) as fscope:                                                                 
        do_return = False
        retval_ = ag__.UndefinedReturnValue()
    
        def get_state():
            return (x,)
    
        def set_state(vars_):
            nonlocal x
            (x,) = vars_
    
        def loop_body():
            nonlocal x
            ag__.converted_call(ag__.ld(tf).print, (ag__.ld(x),), None, fscope)
            x = ag__.converted_call(ag__.ld(tf).tanh, (ag__.ld(x),), None, fscope)
    
        def loop_test():
            return (ag__.converted_call(ag__.ld(tf).reduce_sum, (ag__.ld(x),), None, fscope) > 1)
        ag__.while_stmt(loop_test, loop_body, get_state, set_state, ('x',), {})
        try:
            do_return = True
            retval_ = ag__.ld(x)
        except:
            do_return = False
            raise
        return fscope.ret(retval_, do_return)

    The string can also be converted back with parser.parse this means that nodes could be saved in a text file and then converted back. This seems ok.

  2. ctx is instead a tensorflow.python.autograph.pyct.transformer.Context object, and I saw internally it has other objects references inside like: EntityInfo, naming.Namer and converter.ProgramContext, so in absence of an already made utils, I guess I should go at the end of the chain of objects and save all the parameters needed to create a ctx object and then load them with a function creating the objects pointed by ctx from these parameters and then create the ctx object

  3. Could you give some insights on how to pass precompilation to the transform_function (passing through tf.function, mybe just as an extra kwarg?)

  4. What do you think about the other way which I mentioned above of spelling out the tf.functions with autograph before the compilation and then compile it without autograph, with @tf.function(autograph=False)? Related to this, would you happen to know what ag__ should be for the python interpreter to parse the code?

Please let me know your thoughts about this. Thank you

mdanatg commented 2 years ago

<save nodes and ctx to cache file> and nodes, ctx = <import from persistent file> are there already some utils to save those that I could use ?

There are some functions in parser.py that might be of help.

how to define ag__

The module is dynamic at the moment, which has been a source of headackes. It's a bit runaround (lacks a dedicated API), but you could obtain it by calling api.PyToTF().get_extra_locals()['ag__'] (defined here)

(tanh_loop instead of tf__tanh_loop)

I think the function is renamed later, so that should be ok. Not 100% though, you'd have to verify.

save all the parameters needed to create a ctx object and then load them with a function creating the objects pointed by ctx from these parameters and then create the ctx object

That's about right. The context object was not designed for serialization, but I think it should be straightforward to add a few utilities for doing it, it's a PODO. You may want to add a couple of "encode/decode" methods to avoid leaking its contents. The user field might be a bit tricky, we'd probably need to define an interface and require users to implement that interface for serialization.

how to pass precompilation to the transform_function (passing through tf.function, mybe just as an extra kwarg?)

Yea, threading an extra kwarg sounds like the best option. It's not immediately clear which functions should receive it, so we'd have to prototype and iterate a bit.

What do you think about the other way which I mentioned above of spelling out the tf.functions with autograph before the compilation and then compile it without autograph

I think that's worth a try as well. If we resolve the ag__ issue, then it would be fairly clean.

stanislav-xalgo-io commented 3 months ago

Has there been any updates on this? It would be a very useful feature!