tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

question about dropconnect #494

Open CoinCheung opened 5 years ago

CoinCheung commented 5 years ago

Hi,

Thanks for bring this awesome work to the community!!

When I see the code I find there is a drop-connect operation which seems to set the activations of some of the samples in the input batch to zeros. I have just read the paper of drop-connect which says the drop-connect operation works to randomly drop some of the weights for each sample in the input batch. I think the operation introduced in the paper will not cause the whole activations of one sample to be zeros. The drop-connect operation in the repo seems to drop the whole conv filters rather than some of the weights in it. Could you please explain why did you implement it like this?

ijkilchenko commented 5 years ago

More precisely, drop-connect should randomly set the connection (hence the name) between neurons to zero. It's a generalization of drop-out because in drop-out ALL of the connections to a specific neuron are turned off but in drop-connect only SOME of the connections might be off and so the neuron can stay partially turned on. Here's another explanation

Do you have a specific line in some model that you're confused about? I could try to explain further.

CoinCheung commented 5 years ago

Hi, Thanks for explaination!!

I am confused about this line: https://github.com/tensorflow/tpu/blob/56cd4cd324ad84145112bafa4aa83d418436951d/models/official/efficientnet/utils.py#L157.

According to the paper, dropout randomly ignore some of the outputs of one layer, and dropconnect ignore the connect with will not cause some of the output to be 0, but the impact on one output from certain previous neurons will be ignored. I think that is the name of drop 'out' and drop 'connect' come from. I am confused that the line 157 creates a mask of shape (batchsize, 1, 1, 1) which would drop some of the samples from the batch. Does this the precious meanings of drop-connect proposed in the paper?

ijkilchenko commented 5 years ago

This line doesn't drop samples from the batch. It drops connections

CoinCheung commented 5 years ago

Hi, I am still confused, when batchsize=1, the random_tensor can be a tensor like this: [[[[0]]]]. When we multiply this zero tensor to the activation from previous layer, the input activation to the following layer would be all 0. It looks more like drop sample rather than drop connect, doesn't it?

I believe drop connect means to drop some weight values in the convolution kernels, am I correct to understand it like this?

ijkilchenko commented 5 years ago

In your example, you've reduced drop connect to drop out is all.

mingxingtan commented 4 years ago

Hi @CoinCheung, good catch on this issue. As cited in the paper, we are using "stochastic depth", the name of "drop_connect" means dropping the entire conv and only keeping the residual. The "drop_connect" is more like "survival probabilities" in the original paper, but for some historical reason, we use drop_connect ratio (more intuitive).

Does it make sense to you?

CoinCheung commented 4 years ago

Yes, thanks for explaining.

jnyborg commented 4 years ago

Just as a follow up to this question - can anyone explain why the input is always divided by the constant keep_prob in the function? It's this line: https://github.com/tensorflow/tpu/blob/56cd4cd324ad84145112bafa4aa83d418436951d/models/official/efficientnet/utils.py#L159

Is this supposed to be part of the drop connect that happens if the layer is not set to zero by stochastic depth? Or is this function only supposed to implement stochastic depth - then this division should be unnecessary, right?

Renthal commented 4 years ago

To my understanding, DropConnect and Stochastic Depth are two separate concepts. While the former is a form of regularization which operates on the weights of a particular layer (or neuron) the latter is dropping entire layers altogether.

In the original paper of stochastic depth the decision whether to drop a layer or not is done an a mini-batch basis (either the entire batch is affected or none of it) In the EfficientNet implementation, however, it appears that the entire output of a layer is dropped on a "per-sample" basis (since the binary tensor has size equal to the mini-batch size. Is there a specific reason for this change? Moreover, this behavior is an implementation of stochastic depth and not really a drop connect, therefore I think that using the term "drop_connect" is not more intuitive but rather the opposite: more confusing.

Regarding the dividing the input by the keep_prob is also not 100% clear to me why is it done but I came up with a tentative explanation. The original paper of stochastic depth, shows in Equation (5) how to account for model ensemble perspective at test time: by multiply the input with keep_prob (and not divide it). This should happen at test time only and not at train time only. Is it possible that dividing it at train time is a trick for not have to multiply it a test time? It seems to me that the effect would be similar, but maybe I am wrong. If that is the case, I do not fully understand the advantage of doing so, rather than follow faithfully the original implementation.

jnyborg commented 4 years ago

Nicely spotted! I agree, the division by keep_prob does indeed seem like a trick to avoid having to multiply by it at test time.

Renthal commented 4 years ago

Thank you! Apparently the terminology issue has been addressed here (https://github.com/tensorflow/tpu/commit/cd433314cc6f38c10a23f1d607a35ba422c8f967). Since it was not mentioned I hope I linked the commit correctly.

The computation of the survival tensor per-sample basis vs. per-batch, however, remain unaddressed.

mingxingtan commented 4 years ago

Yes, I have renamed "drop_connect_ratio" to "survival_prob" as shown in the comment of @Renthal

We use per-sample basis and perform scaling at training time, in order to simplify the eval graph. At test-time, we simply ignore stochastic depth. This should be equivalent to re-scale activations at test-time.

baynaa7 commented 3 years ago

Hi @CoinCheung, good catch on this issue. As cited in the paper, we are using "stochastic depth", the name of "drop_connect" means dropping the entire conv and only keeping the residual. The "drop_connect" is more like "survival probabilities" in the original paper, but for some historical reason, we use drop_connect ratio (more intuitive).

Does it make sense to you?

Hi @mingxingtan , you said that "dropping the entire conv". However in efficientnet code (git tensorlflow/tpu)

issueEff

certain case can be valid. More specifically, as can be seen in tensorboard output image (above image), after passing each layer of block_2 (for example), it adds block_1's output tensor with batchnorm's output tensor (which is also divided by survival prob and multiplied with binary tensor), so output tensor (say output_block_2) is just sparse tensor. Then the tensor is used in add layer as one input. If all values of output_block_2 are zero, then it could be seen as stochastic depth procedure. However, output_block_2 seems always sparse tensor. If the tensor output_block_2 is sparse, then code seems more like implementing drop out not stochastic depth.

Can you clarify it? Thanks,