Closed JulioHC00 closed 7 months ago
one problem that i can see is that LipschitzLinear
also needs to take the kind=
argument correctly.
There are two options. Either use kind="one" in all layers, or use "one-inf" in the first and "inf" in all others. The second method is preferred because it has theoretical guarantees.
self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf"))
self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf"))
self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf"))
here is the plot i get
Ah! That does solve it. Thanks for the fast answer! I think this is probably the part of the paper where I need to iron out my understanding. For now, I'll just use it as you said. If I may ask, would you recommend using this architecture for modelling a cumulative hazard function, given it must be monotonic with respect to one input and be 0 when that input is 0?
yeah, you came to the right place I think. to ensure that your output is 0 at 0, you can either constrain your network to have no biases (but that is too restrictive imo) or make your output always be (model(x) - model(0)).
please close the issue if you're happy with the answer.
Thanks! Great ideas!
Hi! Sorry for reopening, but I wanted to make sure I understand why we use "one-inf" in the first layer and "inf" in the others. Is this equivalent to applying the normalization of equation (11) in the paper? And does applying this kind of normalization provide more expressiveness of the network?
On a different note, I was wondering how the lipschitz_const
value of the MonotonicWrapper
or of the individual LipschitzLinear
layers affect the model. If it's set to 1, does it mean the maximum gradient the model can have with respect to a particular input is 1? At first, I thought it would be this, but I'm able to fit functions with larger gradients.
Sorry if some of these questions are too simplistic! This is a very interesting idea that's presented in the paper, and I want to make sure I understand it correctly along with its limitations before using it.
yes, this is eq 11.
this normalization makes sure that df(x)/dx_i <= lipschitz_const for every input x_i , as you correctly assume.
can you give me an example of fitting df/dx_i > 1 functions?
note that the total lipschitz constant is the product of all lipschitz constants. So if you use a lipschitz const of != 1, take that into account for the lipschitz const of the monotonic wrapper.
Hi! Thanks! So, how does the Lipschitz constant of the wrapper interact with that of the intermediate layers?
Here's an example fitting $10 x_0 + x_1$ (which I believe has a Lipschitz constant larger than 1 but I may be wrong given I only recently learned about Lipschitz continuity)
import torch
import torch.nn as nn
import torch.nn.functional as F
import monotonicnetworks as lmn
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.layers = nn.Sequential()
self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32))
self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32))
self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1))
self.monotonic_layers = lmn.MonotonicWrapper(self.layers, monotonic_constraints=[1, 0])
def forward(self, x):
out = self.monotonic_layers(x)
return out
def real_function(x):
return 10 * x[0] + x[1]
inputs = torch.randn(1000, 2)
labels = real_function(inputs.T).unsqueeze(1)
# Add some noise to labels
labels += torch.randn_like(labels) * 0.1
model = SimpleModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in (range(1000)):
optimizer.zero_grad()
output = model(inputs)
loss = F.mse_loss(output, labels)
loss.backward()
optimizer.step()
if epoch % 100 == 0:
print(loss.item())
# Plot predictions vs real function for a value of x[1] of 0 and a range of x[0] values
import matplotlib.pyplot as plt
x = torch.torch.randn(1000, 2)
x[:, 1] = 0
# Sort x values
x = x[x[:, 0].argsort()]
y = real_function(x.T).unsqueeze(1)
ypred = model(x)
plt.plot(x[:, 0].detach().numpy(), y.detach().numpy(), label="Real function")
plt.plot(x[:, 0].detach().numpy(), ypred.detach().numpy(), label="Predicted function")
plt.legend()
you forgot the kind="inf" and kind="one-inf" here.
self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf"))
self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf"))
self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf"))
maybe we should change the default to always "one", then not specifying is at least reasonable.
EDIT: changed defaults to "one" always, which is an acceptable strategy and works with the monotonicity.
Of course! Thanks! Same thing I forgot in the original question. All clear now. Just one final question, if it's OK. To make a model that has a Lipschitz constant larger than 1 (is keeping it at 1 too restrictive?), do I need to make the Lipschitz constant of each layer (overall_lipschitz_constant)**(1/n_layers)
and then for the monotonic wrapper set the lipschitz constant to overall_lipschitz_constant
?
I guess I'm still a bit confused because even with the "one-inf" and "inf" parameters like here
self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf", lipschitz_const=1))
self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf", lipschitz_const=1))
self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf", lipschitz_const=1))
self.monotonic_layers = lmn.MonotonicWrapper(self.layers, monotonic_constraints=[1, 0], lipschitz_const=1)
It seems to still have a derivative larger than 1?
yes, you are correct. here we have a choice between conventions. the lipschitz constraint makes the model (without the monotonic wrapper) have lipschitz constant 1, i.e. df/dx_i in [-1, 1] for each i. Now, the skip connection "moves this up" so as to be monotonic, i.e. it goes from [0,2] now. If you look into the code, a MonotonicLayer
divides the outputs by two to get back [0,1], but the MonotonicWrapper
doesn't do that. So the maximum slope you should see in your code is 2.
If one divides by two, the inputs that have monotonic constraint=0 would be within [-1/2, 1/2], so we didn't want to do that. Which one do you find more intuitive?
Ah! I think the MonotonicWrapper
way is more intuitive now that you mentioned how it works. After all, it follows from equations (5) and (6) in the papers, but I failed to notice that! I think that answers all my questions, so I'll close the issue. Thanks a lot!
Hi! This may be a stupid question owing to the fact I just started using this package and there are still a couple of concepts from the paper I have to make clear in my head. I'm trying to use monotonic networks to model cumulative hazard functions which need to be monotonically increasing (and 0 for a particular input being 0 but not relevant here). However, when experimenting with some dummy data and this package to make sure I understand how it works I've become a bit confused.
As I understand, one can define a MLP using the LipschitzLinear layers and GroupSort (instead of using activation functions like relu). Then you use the MonotonicWrapper to impose monotonic constraings. Below I've included a minimal example of how I think this should be done. So I'm defining a model that should be monotonically increasing on the first input and non-constrained on the second. I've also deliberately used a function that is not monotonic on any input to test that the model indeed can only model monotonic functions. My expectation was that training this model would fail, but I get a result that is non-monotonic and fits the true labels. I've included a plot of the predictions when the second input is 0 and for different values of the first input. I imagine that this is not a problem with the package but me misunderstanding how to properly use it. Any help is appreciated!