Open RSKothari opened 4 years ago
What do you use the two remaining output channels for? Maybe the loss function is bumpy because of whatever loss function you apply to those?
Hi @javiribera , those two channels remain unbounded, i.e, I don't attach them to any loss function. I believe I need to provide a detailed report on the analysis I've done.
First observation: The larger the batchsize, more stable is the training. Minimum 16 BS was required by me to ensure convergence was stable for a longer time. Learning 5e-5.
Case A: Only wHauss. wHauss when used without any loss functions - works only when activation function is sigmoid in a 3 channel segmentation output (2 are unbounded and are free to assume whatever they want). Sigmoid will work until minima is reached and then training crashes. Softmax across all 3 fails spectacularly.
Case B: wHauss in a multi-task paradigm When combined with other loss functions (if interested, please check https://arxiv.org/pdf/1910.00694.pdf) remains stable at minima. Interestingly, works well with small batchsizes and softmax.
Case C: wHauss with pretrained stable weights When init training using pretrained weights, wHauss remains stable for a considerable portion of time although eventually it will crash after minima.
First observation: The larger the batchsize, more stable is the training. Well, this is true for any mini-batch SGD-based optimization.
Maybe this discussion helps: https://github.com/javiribera/locating-objects-without-bboxes/issues/2
I cannot help with segmentation tasks since I have never applied the WHD to that purpose and it was not the intention of the paper. This repository is the implementation of that paper and it is not intended to be an all-in-one code for other tasks.
So let's focus on your original case (0). The problem of interest is that when using it by itself, you see the WHD loss decrease in a very noisy manner. You mention it converges whithin 1 epoch, which seems very fast. I do remember that the WHD is noisy but never found it a huge problem. Do you see the same with SGD?
Would you happen to have any intuition on this? I'm using a U-net style network (with skipped connections). Output -> 3 channels. The centre of mass, in my case, the pupil centre, is regressed from channel 1.
I use torch.sigmoid on channel 1 before giving it as input to weighted H loss and a sufficiently small learning rate (5e-5) with ADAM.
I observe that the loss reduces 0.03 -> 0.009 and the output starting to look as expected from channel 1, i.e, we start seeing the expected blob. Post convergence to a minima (which happens within 1 epoch), the loss goes its maximum (0.1 in my case) and stays there. I checked the gradient norms and found that there is a lot of fluctuation in the norm values. Furthermore, the loss is jumpy on every iteration.
Would you have an intuition about this?