lishen / end2end-all-conv

Deep Learning to Improve Breast Cancer Detection on Screening Mammography
Other
367 stars 126 forks source link

Our miccai 2017 paper also works for the whole mammogram classification. #1

Closed wentaozhu closed 6 years ago

wentaozhu commented 7 years ago

Hi Shen,

I got your email for your paper. Our MICCAI 2017 paper proposes several schemes for whole mammogram classification.

Zhu, Wentao, Qi Lou, Yeeleng Scott Vang, and Xiaohui Xie. "Deep multi-instance networks with sparse label assignment for whole mammogram classification." MICCAI (2017).

Thanks! Wentao

lishen commented 6 years ago

Dear @wentaozhu ,

Thank you for forwarding me this paper! I have read through it. I can definitely see the innovation of imposing sparsity on the patches' probabilities. However, this sort of constraints is actually already implied in a convolutional network when l1 or l2 weight decay is used. I do not use l1 decay because it often makes training rough. And it does not improve the final score much. I only use l1 norm when I'm interested in feature selection. What's more, you have introduced one additional hyperparameter in your framework, which can be annoying to tune in practice.

I also want to point out that you have used two layers of softmax activations. That may impede gradients flow in a deep net.

Since this issue is not related to a bug or new feature. I'm going to close it now.

Li

wentaozhu commented 6 years ago

Dear Shen,

The sparsity of the response is not related to the the sparsity of weights. Using l1 for the last layer's weights or something like that, is improper. We force the mass probability map sparse. Also, we treat the label for patches as latent variable, and try to assign labels for these patches.

Thanks a lot!

Wentao

lishen commented 6 years ago

@wentaozhu , My bad for not reading it thoroughly. But I disagree with you that the activity sparsity is not related to the weight sparsity. They are actually highly related. For the sake of argument, assume a neuron's activity is forced to be zero by the L1 regularizer, then all weights connected to that neuron will receive zero gradients during training. This is equivalent to setting those weights to zero because they don't make contributions to the final decision.

Activity regularization is a common practice so that it is an off-the-shelf feature in some deep learning frameworks. For example, check out this Keras document: https://keras.io/layers/convolutional/#conv2d. All you need to do is to use an L1 regularizer for the activity_regularizer argument for a layer.

Anyway, thank you for getting back to me! Your paper was an interesting read.

wentaozhu commented 6 years ago

Dear Prof. Shen,

Thank you very much for your expertise!

Sparse weights mean it is independent to your data. If we force the weights of the last layer or last two layer sparse, it means our model has bias to some positions or some categories.

Sparse activity means the responses of our model given our data are sparse, which is typically independent to the position in the map. It forces the model to learn some intrinsic features from data.

Thank you very much for your interest and the discussion!

Thanks, Wentao

lishen commented 6 years ago

@wentaozhu ,

You method can actually be replicated in my code like this: add a heatmap with softmax activation, a FC layer and an output layer on top of the patch classifier. The key idea in your paper is to rank the patches first and focus on the top-1 to be correctly classified. Because of this ranking, it makes the decision independent from positions.

However, this can be implemented through an FC layer. To understand that, let's perform a thought experiment. Assume the flattened softmax layer has a size of 100 (i.e. 100 patches) and the FC layer also has a size of 100. All I need to do is to make the weight matrix for the FC layer to be diagonal:

weight matrix = diag(W_1,1, W_2,2, ..., W_100,100)

let W_1,1, W_2,2, ... , W_100,100 have large values. Then if one of the patch has a value close to 1, one of the element of the FC layer will be activated. The entire FC layer is inactivated iff all patches have values close to 0. This is just like the multi-instance learning assumption you made in your paper, right? Then because the FC layer is fully connected with the output layer, if one element is activated, it will make the final output to produce a probability close to 1, assume all weights are properly learned.

To make it sparse, all you need to do is to add an L1 activity regularizer for the FC layer. That's why I said your approach is already implied in a regular neural network in the very beginning. Your method just expresses this idea explicitly.

A large neural network looks like a blackbox, you just have to look inside it to understand what's going on.