neuroailab / tfutils

Utilities for working with tensorflow
MIT License
25 stars 8 forks source link

future things to do #69

Closed yamins81 closed 6 years ago

yamins81 commented 7 years ago

overall interface

class interface instead of functional interface

.get_data

.get_model

.get_loss

.get_optimizer

.train --> .train_loop

.train_from_params default order of calls and some glue

---> issue of how logging is done?

---> or there's the "callable" method


multi-gpu stuff

distribute over nodes

rationalize and simplify the way multi-gpu (within node) of a single model is handled in the base functions

validation w/o creating a new copy of the model? on a different thread or not?

training multiple models on different GPUs with a common data input setup --> relieves network congestion


loading

mapping the names for partial loading

combining multiple sources for loading? (low priority)


misc

actually get the per-variable learning rates and store them

allow the restarting with specifiable/controllable per-variable learning rates

modify the standard optimizer to return a summary of the gradients and log them

make the seed in the standard data provider chooser depend on the saved global batch if you want.

put a less terrible version of the graph abstraction into the standard saver for saving the beginning of training.

convenience loader for stuff saved to GFS the interface should be nice it should handle the too-big-to-load in a standard problem

make sure that threads that fail to write to the DB kill the thread by emitting coordinator stop

in general, deal with threads not throwing coordinator stop signals when appropriate

don't allow specifying the cache directory manually create subfolders in the cache directory every time a new model is run

cache directory loading issues in test_from_prarams (ask Chengxu)


fix bugs

problem with the untar'ing if cache directory is on the NAS

alexandonian commented 7 years ago

For bookkeeping purposes and anyone interested in the status of this issue, I am just posting a brief summary of the current plans to address the overall interface and multi-gpu stuff.

Description of requested improvements

For a given tfutils experiment running on a single node, optimize the behavior of get_data, get_model, get_loss, get_learning_rate, get_optimizer, and train with respect to:

  1. evaluating the model in a particular 'phase'/'mode' (e.g. train vs. test/validation)
  2. deploying multiple (potentially different) single-gpu models and/or one multi-gpu model
  3. distributing potentially common dataproviders to all models running on a node
  4. simplifying access to arbitrary graph 'Ops' during an experiment
  5. providing greater control over the innermost training loop

Summary of potential solutions:

First suggested change: To address requests (1) and (4), it was suggested that lines 989 - 1003 in master/tfutils/base.py containing calls to get_model, get_loss, get_learning_rate and get_optimizer should be replaced with a single, user-specified (optional) function (potentially called get_targets) that accepts train_inputs (during training) or vinputs (during validation)provided by a call to get_data and returns the final train_targets dictionary, which, in this case, could be any desired set of valid graph Ops. Users needn't explicitly specify this function since it should default to its current behavior, and if they do, it needn't contain calls to loss or optimizer ops (if users just want to make a simple forward pass, for example).

An additional suggestion was to pass this get_targets function some sort of flag that indicates the experiment's 'phase' or 'mode' (e.g. whether you are in training or testing) that alters the contents of the target dict it returns.

Analysis of first suggested change: This change is more subtle than it appears. Initially, it seemed appropriate given the requests it aimed to address. Users would certainly have tight control over which Ops are to be evaluated (since they would have to be manually specified); however, after looking through master/tfutils/tests/test.py, it seemed like this was already a feature of the current tfutils. Both train_params['targets'] and validation_params['targets'] accept a function func whose purpose is to extract and return particular Ops from the model to be evaluated. Thus, this change seemed to only address request (1), and if the underlying motivation for request (1) is to eliminate redundancies between the training and validation routines, then there may be other changes that are better suited for that (e.g. consolidating base.train() and base.test` into one since they have almost identical structure).

It turns out constructing the entire graph from input data to train/val target Ops all at once may help to facilitate requests (2) and (3). It is much easier to share dataproviders and distribute model ops across GPUs in a node when it's done together. Currently, you cannot retrieve the final target ops without accessing both the model_params and train_params/validation_params because the full model definition is stored in the former and the target ops are specified in the latter. In the single-gpu, single-model, single-dataprovider case, calling get_data -> get_model -> get_loss etc. sequentially is doable because get_data instantiates a single provider, which produces a single set of inputs to be fed into a single model for which there is a single function that extracts the appropriate target ops etc. To call these functions sequentially and independently in the distributed case requires that global information about the configuration of the entire node be passed from one function to the next... This just seems cumbersome and slightly annoying.

Second suggested change: To address request (5), replace the call to sess.run(train_targets) on line 795 with train_results = train_loop(sess, train_targets), where train_loop is some user-defined callable that describes a custom training loop.

Analysis of second suggested change: This appears to be a very worthy change and appears in the 'Proposed plan of attack' section below.

Proposed plan of attack:

Package and repurpose get_data get_model, get_loss, and get_optimizer, etc.

The functionality offered by these routines obviously needs to stay, and packaging all of them into a single function call seems to be the most convenient way to also address requests (2) and (3) (i.e., incorporate the distribution of a model or collection of models over the GPUs into tfutils.base.py and appropriately distribute dataproviders to individual models or GPUs to maximize data sharing and reduce network congestion).

Here is an example that demonstrates how this might work:

Step 1: For each model you are interested in running, write a function that describes the single-gpu graph. This step is the same as before. Note: You needn't specify how this model is distributed across the GPUs in the model function. Instead, specify how they are distributed in the model_params dictionary.

For example, suppose you you would like to train two identical models, say model.mnist_tfutils and model.alexnet_tfutils, each distributed across 2 GPUs. You might specify this configuration like so:

params['model_params'] = {'gpu0': {'func': model.mnist_tfutils},
                          'gpu1': {'func': model.mnist_tfutils},
                          'gpu2': {'func': model.alexnet_tfutils},
                          'gpu3': {'func': model.alexnet_tfutils},
                          }

Or, perhaps like this:

params['model_params'] = {'mnist': {'func': model.mnist_tfutils,
                                    'devices': [0, 1]},
                          'alexnet': {'func': model.alexnet_tfutils,
                                      'devices': [2, 3]},
                          }

Step 2: Next, specify how data is distributed to each model and each node. TBD: This can be done in many different ways. You can distribute by:

For the sake of the example, imagine you specify by model:

params['train_params'] = {'data_params': {'mnist': {'func': data.MNIST,
                                                    'batch_size': 100,
                                                    'group': 'train',
                                                    'n_threads': 4,
                                                    'devices': [0, 1]},  # Devices 0 and 1 share the MNist provider
                                          'alexnet': {'func': data.ImageNet,
                                                      'batch_size': 100,
                                                      'group': 'train',
                                                      'n_threads': 4,
                                                      'devices': [2, 3]}},  # Devices 2 and 3 share the ImageNet provider
                              'queue_params': {'queue_type': 'fifo',
                                               'batch_size': 100},
                              'num_steps': 500
                              }

The pseudo-code for a private function that distributes models and dataproviders depends heavily on the structure of the dicts above, but it might look something like this:

def _distribute(**params):

    for model in distinct_models:                #  For each distinct model (e.g. model.mnist_tfutils and model.alexnet_tfutils)
        dataprovider = generate_provider(model)  #  Generate a single provider for each model type
        for gpu in model.devices():    
            gpu.deploy(model)                    #  Deploy an instance of the model type on this gpu
            dataprovider.assign(gpu)             #  assign the current provider to feed data to the model housed on this gpu

The output of this pipeline should be a final targets_dict (e.g. train_targets) that can be passed to the user-specified, train_loop callable. Since the user is involved in specifying the final targets, he/she can request any Op needed to execute the custom training loop.

Lastly, the custom training loop would be called like so:

        if train_loop is not None:
            train_results = train_loop(sess, train_targets)
        else:
            train_results = sess.run(train_targets)

defaulting to its old behavior if train_loop is left as None.

Concrete example where proposed solution alleviates issues.

 base.get_params()
    sess, queues, dbinterface, valid_targets_dict = base.test_from_params(**params)
    # start queue runners
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord, sess=sess)
    """
    FIRST ISSUE: Getting handles to network parts
        Above, base.test_from_params() is called to retrieve handles on the sess, 
        queues, dbinterface and valid_dict_targets, which are then restarted
        for the second time (since they are called once already in base.test_from_params()). 

        Below, specific ops and placeholders are extracted from the valid_targets_dict().

    REQUEST:
        "I would prefer not to have to use valid_targets_dict to get handles to
        arbitrary nodes in the graph but specify a separate test function
        for testing (or train function for training) that outputs me a 
        dict of the handles I want…"
    SOLUTION:
        By implementing the packaged `get_targets` function, you should be
        able to retrieve all of the ops and placeholders during the innermost
        training loop!
    """
    # Ops
    get_images = valid_targets_dict['valid0']['targets']['img']
    encode = valid_targets_dict['valid0']['targets']['encode']
    run_lstm = valid_targets_dict['valid0']['targets']['run_lstm']
    decode = valid_targets_dict['valid0']['targets']['decode']

    # placeholders
    ph_enc_inp = valid_targets_dict['valid0']['targets']['ph_enc_inp']
    ph_lstm_inp = valid_targets_dict['valid0']['targets']['ph_lstm_inp']
    ph_dec_inp = valid_targets_dict['valid0']['targets']['ph_dec_inp']
    ph_dec_cond = valid_targets_dict['valid0']['targets']['ph_dec_cond']
    # unroll across time
    n_context = 2
    """
    SECOND ISSUE:
        For a single input, sess.run must be called separately multiple times.
        (see below).
    SOLUTION:
        You will be able to implement the routine below in 
        train_loop(sess, targets),
        where you can repeatedly call sess.run on the target ops you have 
specified in `targets`. train_loop will be called from within
        base.test_from_params, instead of outside like you have here.
    """
    for ex in xrange(valid_targets_dict['valid0']['num_steps']):
        # get input images
        images = sess.run(get_images)[0].astype(np.float32) / 255.0
        context_images = np.zeros(list(images.shape[:-1]) + list([256]))
        # encode context images
        print('Encoding context:')
        for im in trange(n_context, desc='timestep'):
            context_image = np.expand_dims(images[:,im,:,:,:], 1)
            context_images[:,im,:,:,:] = np.squeeze(sess.run(encode, 
                    feed_dict={ph_enc_inp: context_image})[0])
        # predict images pixel by pixel, one after another
        print('Generating images pixel by pixel:')
        predicted_images = []
        for im in trange(n_context, images.shape[1], desc='timestep'):
            encoded_images = sess.run(run_lstm,
                    feed_dict={ph_lstm_inp: context_images})[0]
            image = np.zeros(images[:,im,:,:,:].shape)
            image = np.expand_dims(image, 1)
            context = encoded_images[:,im-1,:,:,:]
            context = np.expand_dims(context, 1)
            for i in trange(images.shape[-3], desc='height', leave=False):
                for j in trange(images.shape[-2], desc='width'):
                    #for k in xrange(images.shape[-1]): # predict all channels at once
                        image[:,0,i,j,:] = sess.run(decode,
                                feed_dict={ph_dec_inp: image,
                                    ph_dec_cond: context})[0][:,0,i,j,:]
            context_images[:,im,:,:,:] = np.squeeze(sess.run(encode,
                feed_dict={ph_enc_inp: image})[0])
            predicted_images.append(np.squeeze(image))
        predicted_images = np.stack(predicted_images, axis=1)
        np.save('predicted_images.npy', predicted_images)
        np.save('gt_images.npy', images)

General observations:

In order to better understand how tfutils operates, I generated the 'skeleton' for its primary functions:

def test_from_params():                 def train_from_params():
    tf.Session()                                get_data()
    DBInterface()                               get_model()
    get_valid_targets_dict()                    get_learning_rate()
        get_data()                              get_optimizer()
        get_model()                             get_valid_target_dicts()
        check_model_equivalence()                   get_data()
        get_validation_target()                     get_model()
    DBInterface()                                   check_model_equivalence()
    test()                                          get_validation_targets()
                                                tf.Session()
                                                DBInterface()
                                                train()
def test():                                     def train():
    start_queues()                                  start_queues()
    dbinterface.start_time_step = time.time()       dbinterface.start_time_step = time.time()
    run_target_dict()                               run_target_dict()
        run_targets()                                   run_targets()
            sess.run()                                      sess.run()
            dbinterface.save()                              dbinterface.save()
            online_agg_func()                               online_agg_func()
            agg_func()                                      agg_func()
            dbinterface.sync_with_host()                    dbinterface.sync_with_host()
    stop_queues(sess, queues, coord, threads)       dbinterface.start_time_step = time.time()
    dbinterface.sync_with_host()                    sess.run()
                                                    global_step_eval()
                                                    run_targets_dict()
                                                    dbinterface.save()stop_queues(sess, queues, coord, threads)
                                                    dbinterface.sync_with_host()

My overall impression here is that the change to base.get_targets may be a good opportunity to bundle a lot of base.train and base.test into a single function.