Validate mini-batch with convolutionless feature-only reconstruction

wehlutyk commented 6 years ago

It should work at least as well as full batch training in this case.

wehlutyk commented 6 years ago

So mimicking what @jaklevab is doing for #6, here we train to 20000 epochs, saving total final loss, final q_mulogDu_flat_loss, and model_2_loss (i.e. feature reconstruction loss).

Parameters we include (indicative ranges, though this makes for too long a computation):

number of network communities: [10, 25, 50]
size of network communities: [10, 25, 50]
inter-community links (p_out): np.logspace from 1e-6 to p_in with 5 steps
mini-batch size: [10, 50, 100, 200] maxed out at current network size
mini-batch walk length: [2, 10, 25, 50, 100] maxed out at current mini-batch size

Repeat each combination 4 times.

wehlutyk commented 6 years ago

Mini-batching seems to be mighty slow on bigger graphs, but also depends a lot on the characteristics of the graph. I tried a few things, and making the batching faster obvious. Just jotting down the few things I tried (and discarded for now):

For jumpy_walks(): cache the graph generated from the adjacency matrix, and copy the cache each time:

def get_nx_from_numpy_array():
    cache = {}

    def _nx_from_numpy_array(adj):
        assert isinstance(adj, np.ndarray)
        key = joblib.hash(adj)
        if key not in cache:
            cache[key] = nx.from_numpy_array(adj)
        return cache[key]

    return _nx_from_numpy_array

nx_from_numpy_array = get_nx_from_numpy_array()

def jumpy_walks(adj, batch_size, max_walk_length):
    # Check the adjacency matrix is:
    # ... an ndarray...
    assert isinstance(adj, np.ndarray)
    # ... undirected...
    assert (adj.T == adj).all()
    # ... unweighted...
    assert ((adj == 0) | (adj == 1)).all()
    # ... with no diagonal elements.
    assert np.trace(adj) == 0

    g = nx_from_numpy_array(adj).copy()
    while len(g.nodes) > 0:
        sample = jumpy_distinct_random_walk(g, batch_size, max_walk_length)
        yield sample
        g.remove_nodes_from(sample)

Doesn't seem great since nx.Graph.copy() is about the same overhead as the previous nx.from_numpy_array(). Might vary with the sparsity of the graph.

For epoch_batches(), run the jumpy_walks() generator in parallel (as keras does for batches):

from keras.utils.data_utils import GeneratorEnqueuer

def epoch_batches(model, adj, batch_size, max_walk_length, neighbour_samples=None,
                  workers=1, use_multiprocessing=False, wait_time=0.05, max_queue_size=10):
    enqueuer = None
    try:
        enqueuer = GeneratorEnqueuer(jumpy_walks(adj, batch_size, max_walk_length),
                                     use_multiprocessing=use_multiprocessing,
                                     wait_time=wait_time)
        enqueuer.start(workers=workers, max_queue_size=max_queue_size)
        walks = enqueuer.get()

        for final_nodes in walks:
            required_nodes, feeds = _compute_batch(model, adj, final_nodes,
                                                   neighbour_samples=neighbour_samples)
            yield required_nodes, np.array(sorted(final_nodes)), feeds

    finally:
        if enqueuer is not None:
            enqueuer.stop()

Also not super useful, not sure why (overhead of the threading?).

Am now moving on to using real datasets (#8) to see where things are limited in practice.

wehlutyk commented 6 years ago

Also reducing wait_time in Model.fit_generator_feed() has some effect.

jaklevab commented 6 years ago

Seen as the trends given by #6 are monotonic for p_out I think there are some parameter combination we can ignore (to discuss). Also, given the elevated number of models to be tried on the GPUs it might be interesting to move to tf on CPU to take advantage of the LIP/PSMN servers.

wehlutyk commented 6 years ago

Seen as the trends given by #6 are monotonic for p_out I think there are some parameter combination we can ignore (to discuss).

Yep indeed. We can talk about that whenever.

Also, given the elevated number of models to be tried on the GPUs it might be interesting to move to tf on CPU to take advantage of the LIP/PSMN servers.

Ooh good point. Sam tells me PSMN has 5000-10000 cores, that could be worth trying for this kind of embarrassingly parellel parameter exploration.

wehlutyk commented 6 years ago

So let's get these parameter ranges sorted out.

Last time I looked at this I was aiming for:

number of network communities: [20, 50]
size of network communities: [20, 50]
inter-community links (p_out): np.logspace from 1e-6 to p_in with 4 steps
mini-batch size: [10, 50, 100, 200] maxed out at current network size
mini-batch walk length: [10, 50, 100] maxed out at current mini-batch size
runs per parameter set: 1
train to 20000 epochs (without early stopping, as you seemed to say is necessary for #6 right?)

Here we have 2 2 4 4 3 = 192 trainings.

The problem is that the training time for an epoch really depends on the mini-batch parameters. I saw training times of up to 2 hours (20000 epochs) when I launched this (which is why I killed it).

Do you think this is reasonable? I can try again once #21 is done, might be a lot faster.

wehlutyk commented 6 years ago

In fact, the main 'transitions' I imagine are when the mini-batch and RW size are below vs. above the average community size. That's when the samples that the model sees really change.

First, setting RW aside: what happens if the mini-batch size is smaller than the average community (and in that case, what happens to the largest communities that are always above mini-batch size)? I'm guessing the model doesn't see many borders between communities (i.e samples that include significant portions of 2 distinct communities). What does that change in the results of learning?

Second, what happens if the MB size is larger than the average community, but the RW size is smaller? Looks like the same situation as previously, except that the model is exposed to samples with disconnected sub-groups (the RWs), and might think that the average community size corresponds to the size of a RW.

How do these settings influence what is learned? Are there other nasty cases/combinations we're not seeing?

jaklevab commented 6 years ago

192 trainings seems reasonable. For the case in #6 I believe I was doing around 240 trainings which took around a week. As you say some of the trainings take a really long time and you see it only once you're training the model. For the PSMN, I'm not sure how easy it will be to install TF/keras libraries. Maybe we can try running some instances on phrunch/brunch before to see.

wehlutyk commented 6 years ago

This issue is mixing two questions: minibatch validation (i.e. checking it works as expected when there are no convolutions) -- this is now #28 --, and minibatch sensitivity analysis (i.e. exploring how it behaves with different parameter values) -- this is now #29.

Closing in favour of those two issues.

ixxi-dante / an2vec

Validate mini-batch with convolutionless feature-only reconstruction #19