Performance issue about tf.function

luke-who / Federated-Learning-Project

A project that investigated, designed and evaluated different methods to reduce overall up-link communication (client -> server) during federated learning

MIT License

8 stars 0 forks source link

Performance issue about tf.function #1

Open DLPerf opened 1 year ago

DLPerf commented 1 year ago

Hello! Our static bug checker has found a performance issue in tff_tutorials/custom_federated_algorithms,_part_2_implementing_federated_averaging.py: batch_train is repeatedly called in a for loop, but there is a tf.function decorated function _train_on_batch defined and called in batch_train.

In that case, when batch_train is called in a loop, the function `` will create a new graph every time, and that can trigger tf.function retracing warning.

Here is the tensorflow document to support it.

Briefly, for better efficiency, it's better to use:

@tf.function
def inner():
    pass

def outer():
    inner()

than:

def outer():
    @tf.function
    def inner():
        pass
    inner()

Looking forward to your reply.

DLPerf commented 1 year ago

But some variables are depending on the outer function. Code may be more complex if changes are made. Is it necessary to make the change or do you have any other ideas? @luke-who

luke-who commented 1 year ago

Hey! Thank you for raising this. As you mentioned there are variables which depend on the outer function in this case. Here's an idea with the original inner & outer functions stay the same:

# Compile the function outside the loop
batch_train_comp = tff.tf_computation(batch_train)

# Run the loop
model = initial_model
losses = []
train_on_batch = tf.function(batch_train_comp.fn, autograph=False)
for _ in range(5):
  model = train_on_batch(model, sample_batch, 0.1)
  losses.append(batch_loss(model, sample_batch))

In this updated version, batch_train_comp is the compiled TensorFlow function, and train_on_batch is the callable TensorFlow function that wraps around batch_train_comp. The autograph=False argument prevents AutoGraph from converting the loop into a TensorFlow op, so that the function can be reused without retracing. Give it a try and let me know if this works!