idiap / importance-sampling

Code for experiments regarding importance sampling for training neural networks
Other
321 stars 60 forks source link

Support of networks with multiple outputs #26

Closed MaximilianBoemer closed 4 years ago

MaximilianBoemer commented 4 years ago

Hi! First thanks for this nice package, really enjoyed the paper too. Is there an extension planned to use the library for multi-output networks? I would also like to contribute, just thought about how to do it... Would be happy to discuss with you how to do it. Thanks and best regards, Max

angeloskath commented 4 years ago

Hi Max,

Sorry for the late reply and thanks for the good words.

So, it would be quite difficult to use all the machinery with multi-output networks because we actually need to be able to compute the loss and take gradients with respect to all outputs.

The simplest solution I can think of is to create a loss layer that computes all the losses for all the outputs and use it as the last layer of the network. Then we can set the loss to be the identity function and compute the gradient of the loss wrt the inputs of the loss layer.

If you do want to contribute though maybe helping out with the port to TF 2 would be more useful since I guess the library will stop working for many people if we don't update it.

In any case, thanks for your interest in the work and let me know if I can help in any way.

Cheers, Angelos

MaximilianBoemer commented 4 years ago

Hi Angelos,

thank you for your answer. Do you know of any example code where an additional loss layer was introduced? I hope I find time in the next weeks to try out your idea for my fully convolutional network with multiple outputs.

Cheers, Max

angeloskath commented 4 years ago

Feel free to reopen the issue (or a new one) if you need more help regarding this matter.

MasterScrat commented 4 years ago

@MaximilianBoemer any news on this? would really need it too.

MaximilianBoemer commented 4 years ago

Hi Florian! I had a working approach but sadly I can’t share it with you because it was for a project in my company. The approach with the loss layer, which Angelos was proposing works. So I basically concatenated all output tensors after upsampling them to the same shape and computed the loss for the resulting tensor. Most of the work was adapting the input format of my model to the one expected in the importance sampling framework. In my case (8 output maps/single shot detector with 4 heads) could not get a speedup as for the in the paper described classification tasks because the variance in GN was not surpassing the threshold.

One observation I made is that the speedup/surpassing the threshold works better for tasks with a sparse loss (e.g classification), than for dense supervision. E.g speedup for MNIST was working, but not for MNIST autoencoders. @angeloskath Do you have some more insights on this? My assumption is that the dense prediction target pairs lead to a reduction in the variance of the gradient norm.

Hope this helps!

MasterScrat commented 4 years ago

Hey Maximilian, thank a lot for the fast answer :D

I have adopted a similar method for now, where I concatenate all the output tensors. I lose some control, eg I can't assign weight to each loss independently anymore, but it should give me an idea whether I can expect some speed up or not.

My task is a multi-label, multi-class classification problem so your observation about the loss needing to be sparse gives me hope ;-)

My main problem right now is GPU memory usage, as I reported here: #28. I basically have to use a much smaller model (EfficientNetB0 instead of EfficientNetB3) and much smaller images (23x13 instead of 165x95) to make it work within 8GB. That should allow me to get an idea if this approach speeds things up with my dataset or not, but it is too limiting to be used to train a model that I could use in practice. Did you also have this problem?

angeloskath commented 4 years ago

Hi,

@MaximilianBoemer I assume that for autoencoders you would be using an MSE loss which does not suffer from vanishing gradients hence it is much less susceptible to variance because of that (more analysis is needed for that claim actually). Furthermore, the loss would be almost as good as the upper bound proposed in the paper because the gradient is simply the difference and the norm of the difference is the loss (although squared). The important part is that using the threshold we can estimate if IS would help and in many cases it doesn't. In many cases it doesn't even if we had access to the full gradient norms. It would just be marginally better which doesn't justify spending extra computation time for IS.

@MasterScrat Using a loss layer doesn't really take away any flexibility, it just changes the way to define those weights. For instance, the loss layer could be using weights to sum up the losses. Also make sure that you are computing the upper bound correctly, namely it would work better if all the final activations are performed inside the loss layer so the gradient norm is computed wrt the output of the last layer with weights. I will look into the memory usage in the corresponding issues.

Feel free to reopen the issue or ask further questions.

Cheers, Angelos

MasterScrat commented 4 years ago

I haven't looked in details into what a "loss layer" is, as using a final keras.layers.Concatenate seemed to do the trick.

Do you mean that since the Concatenate layer doesn't have weights, the gradient norm used for the upper bound will always be 0?

angeloskath commented 4 years ago

I haven't looked in details into what a "loss layer" is, as using a final keras.layers.Concatenate seemed to do the trick.

Usually when coding in Keras you would have a model definition similar to the following

x = Input(...)
y = Layer1(...)(x)
...
...
y1 = FinalLayer1(...)(x)
y2 = FinalLayer2(...)(x)
model = Model(inputs=[x], outputs=[y1, y2])
model.compile(loss=[func1, func2])

Now what I mean by loss layer is instead you could define your model as follows

x = Input(...)
targets = Input(...)
y = Layer1(...)(x)
...
...
y1 = FinalLayer1(...)(x)
y2 = FinalLayer2(...)(x)
y = concatenate([y1, y2])
l = LossLayer(...)([y, targets])
model = Model(inputs=[x, targets], outputs=[l])
model.compile(loss=lambda y_true, y_pred: y_pred)

Do you mean that since the Concatenate layer doesn't have weights, the gradient norm used for the upper bound will always be 0?

No because the gradient norm is not taken wrt to the weights. What I mean is given the following structure

y1 = FinalLayer1(...)(x)
y2 = FinalLayer2(...)(x)
y = concatenate([y1, y2])
l = LossLayer(...)([y, targets])

the more work that is done in the loss layer, the better the upper bound will be. For instance if you are using sigmoid or softmax use it inside the loss layer.

Angelos

MaximilianBoemer commented 4 years ago

@angeloskath Thanks for your explanation!

@MasterScrat That is also how I did it. In the loss layer just compute loss for each of the concatenated tensors, then eventually weigh them and in the end aggregate them to the final loss.