MonotonicWrapper ignoring monotonic_constraints?

JulioHC00 commented 7 months ago

Hi! This may be a stupid question owing to the fact I just started using this package and there are still a couple of concepts from the paper I have to make clear in my head. I'm trying to use monotonic networks to model cumulative hazard functions which need to be monotonically increasing (and 0 for a particular input being 0 but not relevant here). However, when experimenting with some dummy data and this package to make sure I understand how it works I've become a bit confused.

As I understand, one can define a MLP using the LipschitzLinear layers and GroupSort (instead of using activation functions like relu). Then you use the MonotonicWrapper to impose monotonic constraings. Below I've included a minimal example of how I think this should be done. So I'm defining a model that should be monotonically increasing on the first input and non-constrained on the second. I've also deliberately used a function that is not monotonic on any input to test that the model indeed can only model monotonic functions. My expectation was that training this model would fail, but I get a result that is non-monotonic and fits the true labels. I've included a plot of the predictions when the second input is 0 and for different values of the first input. I imagine that this is not a problem with the package but me misunderstanding how to properly use it. Any help is appreciated!


import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.nn.utils.parametrize as parametrize
from tqdm import tqdm

class SimpleModel(nn.Module):

    def __init__(self):
        super(SimpleModel, self).__init__()

        self.layers = nn.Sequential()

        self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32))
        self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32))
        self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1))

        self.monotonic_layers = lmn.MonotonicWrapper(self.layers, monotonic_constraints=[1, 0])

    def forward(self, x):
        out = self.monotonic_layers(x)
        return out

def real_function(x):
     return torch.exp(- x[0] ** 2) + x[1] ** 2

inputs = torch.randn(1000, 2)

labels = real_function(inputs.T).unsqueeze(1)

# Add some noise to labels
labels += torch.randn_like(labels) * 0.1

model = SimpleModel()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in (range(1000)):
    optimizer.zero_grad()
    output = model(inputs)
    loss = F.mse_loss(output, labels)
    loss.backward()
    optimizer.step()
    scheduler.step()

    if epoch % 100 == 0:
        print(loss.item())

# Plot predictions vs real function for a value of x[1] of 0 and a range of x[0] values
import matplotlib.pyplot as plt

x = torch.torch.randn(1000, 2)
x[:, 1] = 0

# Sort x values
x = x[x[:, 0].argsort()]

y = real_function(x.T).unsqueeze(1)
ypred = model(x)

plt.plot(x[:, 0].detach().numpy(), y.detach().numpy(), label="Real function")
plt.plot(x[:, 0].detach().numpy(), ypred.detach().numpy(), label="Predicted function")
plt.legend()

output

niklasnolte commented 7 months ago

one problem that i can see is that LipschitzLinear also needs to take the kind= argument correctly. There are two options. Either use kind="one" in all layers, or use "one-inf" in the first and "inf" in all others. The second method is preferred because it has theoretical guarantees.

    self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf"))
    self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
    self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf"))
    self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
    self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf"))

here is the plot i get

JulioHC00 commented 7 months ago

Ah! That does solve it. Thanks for the fast answer! I think this is probably the part of the paper where I need to iron out my understanding. For now, I'll just use it as you said. If I may ask, would you recommend using this architecture for modelling a cumulative hazard function, given it must be monotonic with respect to one input and be 0 when that input is 0?

niklasnolte commented 7 months ago

yeah, you came to the right place I think. to ensure that your output is 0 at 0, you can either constrain your network to have no biases (but that is too restrictive imo) or make your output always be (model(x) - model(0)).

please close the issue if you're happy with the answer.

JulioHC00 commented 7 months ago

Thanks! Great ideas!

JulioHC00 commented 7 months ago

Hi! Sorry for reopening, but I wanted to make sure I understand why we use "one-inf" in the first layer and "inf" in the others. Is this equivalent to applying the normalization of equation (11) in the paper? And does applying this kind of normalization provide more expressiveness of the network?

On a different note, I was wondering how the lipschitz_const value of the MonotonicWrapper or of the individual LipschitzLinear layers affect the model. If it's set to 1, does it mean the maximum gradient the model can have with respect to a particular input is 1? At first, I thought it would be this, but I'm able to fit functions with larger gradients.

Sorry if some of these questions are too simplistic! This is a very interesting idea that's presented in the paper, and I want to make sure I understand it correctly along with its limitations before using it.

niklasnolte commented 7 months ago

yes, this is eq 11.

this normalization makes sure that df(x)/dx_i <= lipschitz_const for every input x_i , as you correctly assume.

can you give me an example of fitting df/dx_i > 1 functions?
note that the total lipschitz constant is the product of all lipschitz constants. So if you use a lipschitz const of != 1, take that into account for the lipschitz const of the monotonic wrapper.

JulioHC00 commented 7 months ago

Hi! Thanks! So, how does the Lipschitz constant of the wrapper interact with that of the intermediate layers?

Here's an example fitting $10 x_0 + x_1$ (which I believe has a Lipschitz constant larger than 1 but I may be wrong given I only recently learned about Lipschitz continuity)

import torch
import torch.nn as nn
import torch.nn.functional as F
import monotonicnetworks as lmn

class SimpleModel(nn.Module):

    def __init__(self):
        super(SimpleModel, self).__init__()

        self.layers = nn.Sequential()

        self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32))
        self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32))
        self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1))

        self.monotonic_layers = lmn.MonotonicWrapper(self.layers, monotonic_constraints=[1, 0])

    def forward(self, x):
        out = self.monotonic_layers(x)
        return out

def real_function(x):
     return 10 * x[0] + x[1]

inputs = torch.randn(1000, 2)

labels = real_function(inputs.T).unsqueeze(1)

# Add some noise to labels
labels += torch.randn_like(labels) * 0.1

model = SimpleModel()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in (range(1000)):
    optimizer.zero_grad()
    output = model(inputs)
    loss = F.mse_loss(output, labels)
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(loss.item())

# Plot predictions vs real function for a value of x[1] of 0 and a range of x[0] values
import matplotlib.pyplot as plt

x = torch.torch.randn(1000, 2)
x[:, 1] = 0

# Sort x values
x = x[x[:, 0].argsort()]

y = real_function(x.T).unsqueeze(1)
ypred = model(x)

plt.plot(x[:, 0].detach().numpy(), y.detach().numpy(), label="Real function")
plt.plot(x[:, 0].detach().numpy(), ypred.detach().numpy(), label="Predicted function")
plt.legend()

niklasnolte commented 7 months ago

you forgot the kind="inf" and kind="one-inf" here.

    self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf"))
    self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
    self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf"))
    self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
    self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf"))

maybe we should change the default to always "one", then not specifying is at least reasonable.

EDIT: changed defaults to "one" always, which is an acceptable strategy and works with the monotonicity.

JulioHC00 commented 7 months ago

Of course! Thanks! Same thing I forgot in the original question. All clear now. Just one final question, if it's OK. To make a model that has a Lipschitz constant larger than 1 (is keeping it at 1 too restrictive?), do I need to make the Lipschitz constant of each layer (overall_lipschitz_constant)**(1/n_layers) and then for the monotonic wrapper set the lipschitz constant to overall_lipschitz_constant?

I guess I'm still a bit confused because even with the "one-inf" and "inf" parameters like here

        self.layers.add_module("input_layer", lmn.LipschitzLinear(2, 32, kind="one-inf", lipschitz_const=1))
        self.layers.add_module(f"group_sort_1", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_1", lmn.LipschitzLinear(32, 32, kind="inf", lipschitz_const=1))
        self.layers.add_module(f"group_sort_2", lmn.GroupSort(2))
        self.layers.add_module("hidden_layer_2", lmn.LipschitzLinear(32, 1, kind="inf", lipschitz_const=1))

        self.monotonic_layers = lmn.MonotonicWrapper(self.layers, monotonic_constraints=[1, 0], lipschitz_const=1)

It seems to still have a derivative larger than 1?

output

niklasnolte commented 7 months ago

yes, you are correct. here we have a choice between conventions. the lipschitz constraint makes the model (without the monotonic wrapper) have lipschitz constant 1, i.e. df/dx_i in [-1, 1] for each i. Now, the skip connection "moves this up" so as to be monotonic, i.e. it goes from [0,2] now. If you look into the code, a MonotonicLayer divides the outputs by two to get back [0,1], but the MonotonicWrapper doesn't do that. So the maximum slope you should see in your code is 2.

If one divides by two, the inputs that have monotonic constraint=0 would be within [-1/2, 1/2], so we didn't want to do that. Which one do you find more intuitive?

JulioHC00 commented 7 months ago

Ah! I think the MonotonicWrapper way is more intuitive now that you mentioned how it works. After all, it follows from equations (5) and (6) in the papers, but I failed to notice that! I think that answers all my questions, so I'll close the issue. Thanks a lot!

niklasnolte / MonotonicNetworks

MonotonicWrapper ignoring monotonic_constraints? #8