user bias term for WARP loss

Hi, This is not a question for this particular codebase, but I am just wondering how you achieve this.

Recently I notice that if you are doing a collaborative filtering algorithm and using any kind of WARP-like loss function and if your final output layer is just a linear activation, then the user bias term would have no update. This is not hard to understand after some thought and experiment, it turns out that because value = Eu * Ei + bu + bi and if you take the negative samples from the same user, bias term bu would be perfectly canceled off by the positive and negative samples (because they are the same user). However, for WARP loss, it is suggested that we do not use sigmoid or any of those rescale activation because we are not really fitting 0/1 classes but trying to maximize the penalty of wrong ranking. I tried sigmoid for prediction activation with WARP, the gradient seems to diminish quite fast. It looks like most scores would be close to 1 so the rank penalty is too small.

However, by checking lightfm model's user bias term, it seems that it is updating with WARP and obviously there is no sigmoid like activation for the final prediction (because it is not bounded by 1), so my question is what kind of activation function or any other trick that makes this happen?

lyst / lightfm

user bias term for WARP loss #545