Federated Learning and TF Encrypted

mortendahl commented 5 years ago

Goals

reduce burden on users
- align with TF as far as possible by eg matching their API
- align with TF Federated where possible
- find common ground across TF Encrypted, and minimise concepts and terminology specific to federated learning
- reduce redundancy and ambiguity; make sure every construct has a reason
ensure flexibility to experiment
- models, aggregation strategies, and cryptographic techniques
keep what separated from how to lower complexity and increase reuse
- allow higher-order ops such as Dense and ReLU to sometimes let the how depend on the what

Deliverables:

[ ] TFE native FL API and example
[ ] Bridge for using TFE for secure aggregation with TF Federated
[ ] Bridge for using TFE for secure aggregation with a distribution strategy

Background

TF Distribute strategies

From the docs, distribute strategies are about state & compute distribution policy on a list of devices.

From the guide:

The only things that need to change in a user's program are: (1) Create an instance of the appropriate tf.distribute.Strategy and (2) Move the creation and compiling of Keras model inside strategy.scope.

strategy.scope() indicated which parts of the code to run distributed. Creating a model inside this scope allows us to create mirrored variables instead of regular variables. Compiling under the scope allows us to know that the user intends to train this model using this strategy. Once this is setup, you can fit your model like you would normally [i.e. outside scope].

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)

TF Federated

From the docs:

Currently, TensorFlow does not fully support serializing and deserializing eager-mode TensorFlow. Thus, serialization in TFF currently follows the TF 1.0 pattern, where all code must be constructed inside a tf.Graph that TFF controls. This means currently TFF cannot consume an already-constructed model; instead, the model definition logic is packaged in a no-arg function that returns a tff.learning.Model. This function is then called by TFF to ensure all components of the model are serialized.

On TF distribution strategies vs Federated Core, from the docs:

the stated goal of tf.distribute is to allow users to use existing models and training code with minimal changes to enable distributed training, and much focus is on how to take advantage of distributed infrastructure to make existing training code more efficient. The goal of TFF's Federated Core is to give researchers and practitioners explicit control over the specific patterns of distributed communication they will use in their systems. The focus in FC is on providing a flexible and extensible language for expressing distributed data flow algorithms, rather than a concrete set of implemented distributed training capabilities.

One of the primary target audiences for TFF's FC API is researchers and practitioners who might want to experiment with new federated learning algorithms and evaluate the consequences of subtle design choices that affect the manner in which the flow of data in the distributed system is orchestrated, yet without getting bogged down by system implementation details. The level of abstraction that FC API is aiming for roughly corresponds to pseudocode one could use to describe the mechanics of a federated learning algorithm in a research publication - what data exists in the system and how it is transformed, but without dropping to the level of individual point-to-point network message exchanges.

From the tutorial on text:

def create_tff_model():
  ...
  keras_model_clone = compile(tf.keras.models.clone_model(keras_model))
  return tff.learning.from_compiled_keras_model(
      keras_model_clone, dummy_batch=dummy_batch)

# This command builds all the TensorFlow graphs and serializes them
fed_avg = tff.learning.build_federated_averaging_process(model_fn=create_tff_model)

# Perform federated training steps
state = fed_avg.initialize()
state, metrics = fed_avg.next(state, [example_dataset.take(1)])
print(metrics)

Note that state can used to update a local clone of the model for evaluation after each iteration:

state = fed_avg.initialize()

state = tff.learning.state_with_new_model_weights(
    state,
    trainable_weights=[v.numpy() for v in keras_model.trainable_weights],
    non_trainable_weights=[
        v.numpy() for v in keras_model.non_trainable_weights
    ])

def keras_evaluate(state, round_num):
  tff.learning.assign_weights_to_keras_model(keras_model, state.model)
  print('Evaluating before training round', round_num)
  keras_model.evaluate(example_dataset, steps=2)

for round_num in range(NUM_ROUNDS):
  keras_evaluate(state, round_num)
  state, metrics = fed_avg.next(state, train_datasets)
  print('Training metrics: ', metrics)

keras_evaluate(state, NUM_ROUNDS + 1)

Terminology

Functionality and Protocol: the former is basically used as a function from and to local tensors, while the latter as a more general means to specify how functionalities are to be computed using e.g. cryptographic techniques. As such, protocols are the only one of the two that are used as context handlers. Roughly follows UC terminology although functionalities are not intended to be used as sub-protocols.

Suggested API

# specify the players involved
model_owner = tfe.Player('model_owner')
data_owners = [
    tfe.Player('data_owner_0'),
    tfe.Player('data_owner_1'),
    tfe.Player('data_owner_2'),
]

# build data pipeline on each data owner;
# this would likely be an unique function per owner
data_sources = [
    build_data_pipeline(data_owner)
    for data_owner in data_owners
]

# use fast, non-resilient, secure aggregation based on additive secret sharing
aggregation = tfe.functionalities.AdditiveSecureAverage

# ... alternatively we could have instantiated it explicitly,
# resulting in exactly the same thing
aggregation = tfe.functionalities.AdditiveSecureAverage(
    compute_players=data_owners,
    output_receiver=model_owner)

# ... or we could have unrolled its (simplified) implementation
def aggregation(plaintext_grads):
  pond = tfe.protocols.Pond(data_owners)
  with pond:
    grads = [
        tfe.define_private_input(grad, owner)
        for grad, owner in zip(plaintext_grads, data_owners)
    ]
    aggregated_grad = tfe.add_n(grads) / len(grads)
    return tfe.reveal(aggregated_grad, model_owner)

# initialising the federated protocol with the model owner in order
# to specify where the reference weights of the model should live
# and where updates should happen
federated = tfe.protocol.FederatedLearning(model_owner, aggregation)

# use it as a context to essentially trace the `model_fn` function to be
# executed on both model and data owners, with all variables controlled
with federated:

  # creating outside the context would cause an error due to wrong locality
  # between inputs and weights
  model = tfe.keras.Sequential()
  model.add(tfe.keras.Dense())
  model.add(tfe.keras.ReLU())

  # compiling outside the context would cause an error due to wrong locality
  # between weights and weight updates
  model.compile(
      optimizer=tf.train.AdamOptimizer(0.001),
      loss='categorical_crossentropy',
      metrics=['accuracy'])

# ... alternatively, `model_fn` functions can be passed to the protocol for
# compilation, allowing e.g. for different models to be run on the players
def model_fn():
  model = tfe.keras.Sequential()
  model.add(tfe.keras.Dense())
  model.add(tfe.keras.ReLU())

  model.compile(
      optimizer=tf.train.AdamOptimizer(0.001),
      loss='categorical_crossentropy',
      metrics=['accuracy'])

model = federated.compile({
    player: model_fn
    for player in [model_owner] + data_owners
})

# fitting can be done anywhere (following ordinary TF) yet the locality of
# the training data must match with the data owners
model.fit(data_sources, epochs=10)

jvmncs commented 5 years ago

Spoke with @mortendahl briefly and we agreed that much of the code beneath this API can be written as a custom tf.distribute.Strategy. Roughly speaking, the aggregate function above could be written in the language of these Strategies. This should simplify how this interface works with the Keras and Estimator APIs.

mortendahl commented 5 years ago

Note to self: find a way where the federated protocol simply wraps an existing model on the model owner?

mortendahl commented 5 years ago

Design and discussion has moved into a RFC: https://github.com/tf-encrypted/rfcs/pull/1

mortendahl commented 5 years ago

@yanndupis @jvmancuso the more I look into this the more I think we should simply stick with the FL example we have, but upgrade it to use TF Keras and TF 2.0.

I'll keep working on the details in integrating with TF Federated and distribution strategies but worried the time will be short and not worth rushing.

WDYT?

yanndupis commented 5 years ago

@mortendahl - happy to explore other alternatives. In general, I think it's important that's easy to use and we can support arbitrary Keras models. So maybe we just need to improve the abstraction and automate some of the stuff in the example. For example [here], we would like to avoid defining manually the model weights.

It would be nice to kick the training process with model.fit(data_sources, epochs=10) and see the training process. Also I think we would like to have some utils to distribute the data among several data owners.

jvmncs commented 5 years ago

@yanndupis @jvmancuso the more I look into this the more I think we should simply stick with the FL example we have, but upgrade it to use TF Keras and TF 2.0.

Yeah, I agree. Let's stick with the current abstraction and add some syntactic sugar, like what @yanndupis referred to.

justin1121 commented 5 years ago

It would be nice to kick the training process with model.fit(data_sources, epochs=10) and see the training process.

@yanndupis how do you think we could accomplish the above? Would we create a keras model wrapper? Or be able to make use of tfe.keras somehow?

Also I think we would like to have some utils to distribute the data among several data owners.

Can you provide some more details about this also? Not 100% sure what you mean.

yanndupis commented 5 years ago

Can you provide some more details about this also? Not 100% sure what you mean.

In tf federated, they have this concept of tff.simulation.datasets.emnist.load_data. It constructs a tf.data.Dataset object for each data owner. Currently here, the number of data owner and tfrecord file is hard coded. If we want be able to easily experiment, it would be nice to have a method which takes data as a numpy format (or even tfrecord) and number of players. Then the data would be distributed evenly among these players.

For model.fit, I will put more thoughts into it tomorrow. It could be just wrapping up this section in a .fit. I also like the API above suggested by @mortendahl, we just need to assess if it's too much work for our timeline.

mortendahl commented 5 years ago

For example [here], we would like to avoid defining manually the model weights.

Yes, fully agree.

It would be nice to kick the training process with model.fit(data_sources, epochs=10) and see the training process.

Following the current abstractions in the example, it may make more sense to take the data owners as input instead of the data sources.

it would be nice to have a method which takes data as a numpy format (or even tfrecord) and number of players. Then the data would be distributed evenly among these players.

This seems useful indeed, but let's make sure to clearly mark it as something that's for simulation use only.

justin1121 commented 5 years ago

This is being worked on here. Feedback welcome!

justin1121 commented 5 years ago

In https://github.com/tf-encrypted/tf-encrypted/pull/695 we added the flag to support simulating the splitting the data into the parties. It currently only works with local computations we should see if we can get this to work with remote computations with either tf.device or some other way. See this thread: https://github.com/tf-encrypted/tf-encrypted/pull/695#discussion_r335050263

tf-encrypted / tf-encrypted