Multi-class classification vs Multi-lablel classification

TL;DR

Some differences between multi-class and multi-label classification

softmax([5, 7, 4, 6])

[0.0871 0.6439 0.0320 0.2368]

Loss function categorical_crossentropy Because we use softmax $\to$ We make sure that, if the label is correct (ie argmax yhat coincides with argmax y) then the remaining positions are 0 (or approximately 0) $\to$ ok
Target vector one-hot encoding

sigmoid([2, -1, .15, 3]))

[0.8807 0.2689 0.5374 0.9525]

Loss function binary_crossentropy

Because we use sigmoid Binary Cross-Entropy Loss is also called Sigmoid Cross-Entropy loss. It is a Sigmoid activation plus a Cross-Entropy loss. Unlike Softmax loss it is independent for each vector component (class), meaning that the loss computed for every vector component is not affected by other component values. If we use categorical crossentropy, we just penalize only for case of missing labeling but not for case of residual labeling.
Target vector Like one-hot encoding but it may have multiple one

Upsampling

But we can not simply drop the data samples with majority labels, because these data samples could be associated with other labels as well. Dropping these samples will result in loss of other labels too.
Downsampling

May lead to overfitting