thomasverelst / dynconv

Code for Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference (CVPR2020)
https://arxiv.org/abs/1912.03203
126 stars 14 forks source link

Questions about mask generation #12

Open magehrig opened 2 years ago

magehrig commented 2 years ago

Hi @thomasverelst

Congrats, nice work! I have two questions out of curiosity:

1) Forward pass: Why did you choose to sample from the Bernoulli distribution instead of the Gumbel-softmax? To my knowledge, sampling from the Bernoulli distribution introduces a bias in the gradient estimation which could make optimization trickier. I understand that you would not be able to use sparse convolutions in the training but I wonder if there is another reason.

2) Have you tried annealing the temperature parameter to less than 1?

thomasverelst commented 2 years ago

Hi! I think you mean that it uses the straight-through version of the Gumbel-Softmax trick (hard version). I did not thoroughly ablate this, but my initial results indicated slightly better performance for the hard straight-through version. The straight-through version indeed has bias, but the network's weights directly optimize for the sparse convolutions. I can agree though that the soft Gumbel-Softmax with some temperature annealing towards 0 might improve training stability. The best solution though might be to weight spatial positions by the probabilities (i.e. soft attention), e.g. by using the soft Gumbel softmax and multiplying executed positions (where prob_exec > 0.5 by (prob_exec-0.5)*2 ) both at training and inference time. As I have more compute available nowadays I might explore this over summer when writing my PhD thesis

magehrig commented 2 years ago

Congrats on your soon-PhD! Yes, I would be interested in knowing whether you are able to get the non-straight-through version to work well.

Good luck