Understanding regularization strength

teddykoker / torchsort

Fast, differentiable sorting and ranking in PyTorch

Apache License 2.0

765 stars 33 forks source link

I use torchsort in my loss function. My issue is that sometimes ist returns NaN, depending on the regularization strength. My batches are between 1k and 5k samples and there are ~1k features.

Is there some documentation on regularization strength? scrolling through the code I cannot find anything.

Is there a way to estimate a good regularization strength value depending on your data?

I understand that 1 is the default value and reducing regularization strength brings the result closer to the true ordering. So, is the following a good heuristic?

Stay at 1 as long as your model is learning
if 1 does not return a gradient the optimizer can work with, try a lower value
If there is no value that returns a valid ordering and works for the optimizer ... RIP

Hey @simpsus, apologies for the very late reply I must have missed this. For more information surrounding the regularization strength, it would be best to address the original paper, which denotes the regularization strength as parameter $\varepsilon$. It essentially controls how "soft" the sort/ranking is. As $\varepsilon \to \infty$ the values all collapse to a constant, as $\varepsilon \to 0$ the values converge to the hard soft/ranking values. This also effects how smooth the function is. I don't think there's a great rule to setting the value, in the paper they perform a hyperparameter search using log-spaced values from $10^{-3}$ to $10^{4}$; your best bet is probably to do the same with some cross-validation dataset.

teddykoker / torchsort

Understanding regularization strength #66