Closed wehlutyk closed 6 years ago
So mimicking what @jaklevab is doing for #6, here we train to 20000 epochs, saving total final loss
, final q_mulogDu_flat_loss
, and model_2_loss
(i.e. feature reconstruction loss).
Parameters we include (indicative ranges, though this makes for too long a computation):
[10, 25, 50]
[10, 25, 50]
p_out
): np.logspace
from 1e-6
to p_in
with 5 steps[10, 50, 100, 200]
maxed out at current network size[2, 10, 25, 50, 100]
maxed out at current mini-batch sizeRepeat each combination 4 times.
Mini-batching seems to be mighty slow on bigger graphs, but also depends a lot on the characteristics of the graph. I tried a few things, and making the batching faster obvious. Just jotting down the few things I tried (and discarded for now):
For jumpy_walks()
: cache the graph generated from the adjacency matrix, and copy the cache each time:
def get_nx_from_numpy_array():
cache = {}
def _nx_from_numpy_array(adj):
assert isinstance(adj, np.ndarray)
key = joblib.hash(adj)
if key not in cache:
cache[key] = nx.from_numpy_array(adj)
return cache[key]
return _nx_from_numpy_array
nx_from_numpy_array = get_nx_from_numpy_array()
def jumpy_walks(adj, batch_size, max_walk_length):
# Check the adjacency matrix is:
# ... an ndarray...
assert isinstance(adj, np.ndarray)
# ... undirected...
assert (adj.T == adj).all()
# ... unweighted...
assert ((adj == 0) | (adj == 1)).all()
# ... with no diagonal elements.
assert np.trace(adj) == 0
g = nx_from_numpy_array(adj).copy()
while len(g.nodes) > 0:
sample = jumpy_distinct_random_walk(g, batch_size, max_walk_length)
yield sample
g.remove_nodes_from(sample)
Doesn't seem great since nx.Graph.copy()
is about the same overhead as the previous nx.from_numpy_array()
. Might vary with the sparsity of the graph.
For epoch_batches()
, run the jumpy_walks()
generator in parallel (as keras does for batches):
from keras.utils.data_utils import GeneratorEnqueuer
def epoch_batches(model, adj, batch_size, max_walk_length, neighbour_samples=None,
workers=1, use_multiprocessing=False, wait_time=0.05, max_queue_size=10):
enqueuer = None
try:
enqueuer = GeneratorEnqueuer(jumpy_walks(adj, batch_size, max_walk_length),
use_multiprocessing=use_multiprocessing,
wait_time=wait_time)
enqueuer.start(workers=workers, max_queue_size=max_queue_size)
walks = enqueuer.get()
for final_nodes in walks:
required_nodes, feeds = _compute_batch(model, adj, final_nodes,
neighbour_samples=neighbour_samples)
yield required_nodes, np.array(sorted(final_nodes)), feeds
finally:
if enqueuer is not None:
enqueuer.stop()
Also not super useful, not sure why (overhead of the threading?).
Am now moving on to using real datasets (#8) to see where things are limited in practice.
Also reducing wait_time
in Model.fit_generator_feed()
has some effect.
Seen as the trends given by #6 are monotonic for p_out I think there are some parameter combination we can ignore (to discuss). Also, given the elevated number of models to be tried on the GPUs it might be interesting to move to tf on CPU to take advantage of the LIP/PSMN servers.
Seen as the trends given by #6 are monotonic for p_out I think there are some parameter combination we can ignore (to discuss).
Yep indeed. We can talk about that whenever.
Also, given the elevated number of models to be tried on the GPUs it might be interesting to move to tf on CPU to take advantage of the LIP/PSMN servers.
Ooh good point. Sam tells me PSMN has 5000-10000 cores, that could be worth trying for this kind of embarrassingly parellel parameter exploration.
So let's get these parameter ranges sorted out.
Last time I looked at this I was aiming for:
[20, 50]
[20, 50]
p_out
): np.logspace
from 1e-6
to p_in
with 4 steps[10, 50, 100, 200]
maxed out at current network size[10, 50, 100]
maxed out at current mini-batch sizeHere we have 2 2 4 4 3 = 192 trainings.
The problem is that the training time for an epoch really depends on the mini-batch parameters. I saw training times of up to 2 hours (20000 epochs) when I launched this (which is why I killed it).
Do you think this is reasonable? I can try again once #21 is done, might be a lot faster.
In fact, the main 'transitions' I imagine are when the mini-batch and RW size are below vs. above the average community size. That's when the samples that the model sees really change.
First, setting RW aside: what happens if the mini-batch size is smaller than the average community (and in that case, what happens to the largest communities that are always above mini-batch size)? I'm guessing the model doesn't see many borders between communities (i.e samples that include significant portions of 2 distinct communities). What does that change in the results of learning?
Second, what happens if the MB size is larger than the average community, but the RW size is smaller? Looks like the same situation as previously, except that the model is exposed to samples with disconnected sub-groups (the RWs), and might think that the average community size corresponds to the size of a RW.
How do these settings influence what is learned? Are there other nasty cases/combinations we're not seeing?
192 trainings seems reasonable. For the case in #6 I believe I was doing around 240 trainings which took around a week. As you say some of the trainings take a really long time and you see it only once you're training the model. For the PSMN, I'm not sure how easy it will be to install TF/keras libraries. Maybe we can try running some instances on phrunch/brunch
before to see.
This issue is mixing two questions: minibatch validation (i.e. checking it works as expected when there are no convolutions) -- this is now #28 --, and minibatch sensitivity analysis (i.e. exploring how it behaves with different parameter values) -- this is now #29.
Closing in favour of those two issues.
It should work at least as well as full batch training in this case.