nlesc-dirac / pytorch

Improved LBFGS and LBFGS-B optimizers in PyTorch.
Apache License 2.0
53 stars 4 forks source link

optimizer loss becomes nan #4

Open MicheleBellomo opened 4 months ago

MicheleBellomo commented 4 months ago

I'm trying to use your implementation to faster optimize a problem that I've already trated using different optimizers and libraries. During the first iteration of LFBGS_B, the losses are in the first steps calculated correctly (and they are correctly decreasing), but then suddenly they become nan, and the same happen to the parameters being optimized. What can cause this behavior?

SarodYatawatta commented 4 months ago

Is this with batch_mode=True or False? Can you try both?

MicheleBellomo commented 4 months ago

The problem described happens with batch_mode=False. using batch_mode=True the optimization doesn't start, outputting the following error: in init(self, params, lower_bound, upper_bound, max_iter, tolerance_grad, tolerance_change, history_size, batch_mode, cost_use_gradient) 66 batch_mode=batch_mode, 67 cost_use_gradient=cost_use_gradient) ---> 68 super(LBFGSB, self).init(params, defaults) 69 70 if len(self.param_groups) != 1:

/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py in init(self, params, defaults) 282 283 for param_group in param_groups: --> 284 self.add_param_group(cast(dict, param_group)) 285 286 # Allows _cuda_graph_capture_health_check to rig a poor man's TORCH_WARN_ONCE in python,

/usr/local/lib/python3.10/dist-packages/torch/_compile.py in inner(*args, kwargs) 20 @functools.wraps(fn) 21 def inner(*args, *kwargs): ---> 22 import torch._dynamo 23 24 return torch._dynamo.disable(fn, recursive)(args, kwargs)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/init.py in 1 import torch ----> 2 from . import convert_frame, eval_frame, resume_execution 3 from .backends.registry import list_backends, lookup_backend, register_backend 4 from .callback import callback_handler, on_compile_end, on_compile_start 5 from .code_context import code_context

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py in 38 from torch.utils._traceback import format_traceback_short 39 ---> 40 from . import config, exc, trace_rules 41 from .backends.registry import CompilerFn 42 from .bytecode_analysis import remove_dead_code, remove_pointless_jumps

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/trace_rules.py in 48 from .utils import getfile, hashable, NP_SUPPORTED_MODULES, unwrap_if_wrapper 49 ---> 50 from .variables import ( 51 BuiltinVariable, 52 FunctorchHigherOrderVariable,

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/init.py in 83 UntypedStorageVariable, 84 ) ---> 85 from .torch import TorchCtxManagerClassVariable, TorchInGraphFunctionVariable 86 from .user_defined import ( 87 RemovableHandleVariable,

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/torch.py in 110 torch.fx._symbolic_trace.is_fx_tracing: False, 111 torch.onnx.is_in_onnx_export: False, --> 112 torch._dynamo.external_utils.is_compiling: True, 113 torch._utils.is_compiling: True, 114 torch.compiler.is_compiling: True,

AttributeError: partially initialized module 'torch._dynamo' has no attribute 'external_utils' (most likely due to a circular import)

SarodYatawatta commented 4 months ago

Thanks, is it possible to have an example to reproduce this error? Also, not that ''' This optimizer doesn't support per-parameter options and parameter groups (there can be only one) ''' Could this be the issue ?

MicheleBellomo commented 4 months ago

I cannot provide code to reproduce the error because it is part of a large library with many modules for training a statistical model. Anyway, the second error seems to be generated simply by the "batch_mode=True" call. As for the parameters, I haven't set any specific settings, but there are multiple parameters to be optimized. Does your implementation allow optimizing only one parameter?

SarodYatawatta commented 4 months ago

Yes, create an empty list of parameters and add what you need to solve to this list params=list() params.extend(list(net.parameters())) .... optimizer = LBFGSB(params, ....)

SarodYatawatta commented 4 months ago

also I removed an obsolete file in this directory which might have caused the circular import

MicheleBellomo commented 4 months ago

Perhaps I wasn't clear. I have multiple parameters, but they are all contained within the same container (a PyTorch tensor, to be precise). This doesn't seem to cause any issues. The problems arise during the execution of the program after some closure evaluations.

SarodYatawatta commented 4 months ago

OK, how many iterations are used before values turn to NaN, and what is the history_size?

MicheleBellomo commented 4 months ago

here the log of parameters, gradients and losses obtained with batch_mode=False. note that the gradients become nan far before the parameters. this is an error arising from your optimizer, as with other library as scipy didn't have such a issue. i report also the closure function

def closure():
            optimizer.zero_grad()
            loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
            loss.backward()
            print("Parametri: ", θ)
            print("Gradiente: ", θ.grad)
            print("Loss: ", loss)
            return loss

CUDA is available. Running on GPU. Starting iteration number 1 Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0', requires_grad=True) Gradiente: tensor([ 2954.1091, 917.6224, -725.3356, -1112.6702, 542.8693], device='cuda:0') Loss: tensor([4152.4517], device='cuda:0', grad_fn=) Parametri: tensor([1.0014e-05, 1.0014e-05, 7.2634e+02, 1.1147e+03, 1.0014e-05], device='cuda:0', requires_grad=True) Gradiente: tensor([-1.1364e+08, 0.0000e+00, nan, nan, 0.0000e+00], device='cuda:0') Loss: tensor([13100.3057], device='cuda:0', grad_fn=) Parametri: tensor([5.0001e-01, 5.0001e-01, 3.6367e+02, 5.5834e+02, 2.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([1324., 0., nan, nan, 0.], device='cuda:0') Loss: tensor([2588.8015], device='cuda:0', grad_fn=) Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0', requires_grad=True) Gradiente: tensor([ 2954.1091, 917.6224, -725.3356, -1112.6702, 542.8693], device='cuda:0') Loss: tensor([4152.4517], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([5.0001e-01, 5.0001e-01, 3.6367e+02, 5.5834e+02, 2.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([1324., 0., nan, nan, 0.], device='cuda:0') Loss: tensor([2588.8015], device='cuda:0', grad_fn=) Parametri: tensor([1.2501e-01, 1.2501e-01, 6.3567e+02, 9.7559e+02, 5.0009e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-5503.3965, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2816.3342], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([1.8751e-01, 1.8751e-01, 5.9034e+02, 9.0604e+02, 7.5008e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-2469.0508, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2579.9775], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.1876e-01, 2.1876e-01, 5.6767e+02, 8.7127e+02, 8.7508e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-1602.1205, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2517.0352], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.3438e-01, 2.3438e-01, 5.5634e+02, 8.5389e+02, 9.3758e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-1255.3254, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2494.7939], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4220e-01, 2.4220e-01, 5.5067e+02, 8.4520e+02, 9.6883e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-1098.6952, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2485.5864], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4610e-01, 2.4610e-01, 5.4784e+02, 8.4085e+02, 9.8445e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-1024.1458, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2481.4492], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4805e-01, 2.4805e-01, 5.4642e+02, 8.3868e+02, 9.9226e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-987.7484, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2479.5110], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4903e-01, 2.4903e-01, 5.4571e+02, 8.3759e+02, 9.9617e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-969.7510, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2478.5229], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4952e-01, 2.4952e-01, 5.4536e+02, 8.3705e+02, 9.9812e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-960.8049, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2478.0564], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4976e-01, 2.4976e-01, 5.4518e+02, 8.3677e+02, 9.9910e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-956.3541, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.8237], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4989e-01, 2.4989e-01, 5.4509e+02, 8.3664e+02, 9.9959e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-954.1311, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.7075], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4995e-01, 2.4995e-01, 5.4505e+02, 8.3657e+02, 9.9983e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-953.0197, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6494], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4998e-01, 2.4998e-01, 5.4502e+02, 8.3654e+02, 9.9995e-02], device='cuda:0', requires_grad=True) Gradiente: tensor([-952.4641, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6204], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.4999e-01, 2.4999e-01, 5.4501e+02, 8.3652e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-952.1862, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6301], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-952.0001, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6238], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4500e+02, 8.3651e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9699, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5859], device='cuda:0', grad_fn=) Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4500e+02, 8.3650e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9085, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5913], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9922, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5813], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4500e+02, 8.3651e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9699, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5859], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9980, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5784], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0001e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9922, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5813], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9995, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6255], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9980, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.5784], device='cuda:0', grad_fn=) Parametri: tensor([2.5000e-01, 2.5000e-01, 5.4501e+02, 8.3651e+02, 1.0000e-01], device='cuda:0', requires_grad=True) Gradiente: tensor([-951.9995, 0.0000, nan, nan, 0.0000], device='cuda:0') Loss: tensor([2477.6255], device='cuda:0', grad_fn=) Parametri: tensor([nan, nan, nan, nan, nan], device='cuda:0', requires_grad=True) Gradiente: tensor([nan, nan, nan, nan, nan], device='cuda:0') Loss: tensor([nan], device='cuda:0', grad_fn=) Parametri: tensor([nan, nan, nan, nan, nan], device='cuda:0', requires_grad=True) Gradiente: tensor([nan, nan, nan, nan, nan], device='cuda:0') Loss: tensor([nan], device='cuda:0', grad_fn=) Parametri: tensor([nan, nan, nan, nan, nan], device='cuda:0', requires_grad=True) Gradiente: tensor([nan, nan, nan, nan, nan], device='cuda:0') Loss: tensor([nan], device='cuda:0', grad_fn=) Parametri: tensor([nan, nan, nan, nan, nan], device='cuda:0', requires_grad=True) Gradiente: tensor([nan, nan, nan, nan, nan], device='cuda:0') Loss: tensor([nan], device='cuda:0', grad_fn=)

SarodYatawatta commented 4 months ago

It seems you are taking a log() somewhere, if the input is ~ 0, gradient can be NaN, So this is outside the optimizer, something within your negative_log_likelihood(), try setting torch.autograd.set_detect_anomaly(True) and see where the invalid calculation happens, also try gradient clipping, or adding a small value to the input of log() to make it > 0 (also can try softplus())

MicheleBellomo commented 4 months ago

Yes, I use a logarithm, but with the constraints imposed by L-BFGS-B there should be no problems. As previously mentioned, I have made several implementations of this training and have never had any issues. For example, I have one that leverages the scipy implementation of L-BFGS-B and to which I pass the exact gradient calculated through automatic differentiation from PyTorch. Obviously, this solution is suboptimal, as I cannot take advantage of parallelization and I have to keep switching from PyTorch tensors to numpy tensors. This is why I need L-BFGS-B in the native PyTorch environment. At the end of the comment, I report the logs of the first iterations with the scipy solutions. Before goining on, I need to know if you have thoroughly tested your algorithm and if you are reasonably sure of its implementation.

Loss: 2060.643481812607 Parameters: tensor([0.1800, 1.0000, 1.0000, 2.0000, 0.4000], dtype=torch.float64, requires_grad=True) Gradient: tensor([1554.5775, 639.9076, -455.3975, -774.7195, 377.7018], dtype=torch.float64) Loss: 1877.9156827566933 Parameters: tensor([0.1798, 0.9989, 1.5068, 2.8621, 0.3996], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 289.8926, -190.1614, 372.5729, 68.1309, -189.4678], dtype=torch.float64) Loss: 2069.0895568747283 Parameters: tensor([0.0186, 1.0024, 1.2659, 2.5794, 0.4189], dtype=torch.float64, requires_grad=True) Gradient: tensor([-9632.7549, -265.7956, 383.9284, 283.5783, -194.9197], dtype=torch.float64) Loss: 1841.669200119487 Parameters: tensor([0.1203, 1.0002, 1.4178, 2.7577, 0.4067], dtype=torch.float64, requires_grad=True) Gradient: tensor([-605.2114, -195.3442, 386.1941, 70.2817, -185.4791], dtype=torch.float64) Loss: 1793.1294510600121 Parameters: tensor([0.1091, 1.0019, 1.2501, 2.5340, 0.4153], dtype=torch.float64, requires_grad=True) Gradient: tensor([-370.7308, -43.2679, 203.7400, -64.2272, -74.5376], dtype=torch.float64) Loss: 1772.939724053375 Parameters: tensor([0.1229, 1.0060, 1.1699, 2.5699, 0.4397], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 6.4114, 30.0546, 119.5214, -123.5402, -17.4146], dtype=torch.float64) Loss: 1743.154892900387 Parameters: tensor([0.1501, 1.0175, 1.0074, 2.8174, 0.5124], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 476.3682, 172.3165, -137.5379, -180.5257, 117.7273], dtype=torch.float64) Loss: 1717.680596618049 Parameters: tensor([0.1731, 1.0219, 1.0154, 3.0818, 0.5458], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 661.7246, 114.0387, -55.1897, -138.1382, 83.7804], dtype=torch.float64) Loss: 1669.908914956494 Parameters: tensor([0.1822, 1.0322, 0.9550, 3.6707, 0.6240], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 647.9612, 126.3050, -181.1554, -102.1170, 134.4563], dtype=torch.float64) Loss: 1616.7065713515212 Parameters: tensor([0.1641, 1.0895, 1.0206, 4.6823, 0.6711], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 46.6102, -70.7506, 447.9215, -22.0836, -29.7677], dtype=torch.float64) Loss: 1594.680298256697 Parameters: tensor([0.1540, 1.1410, 0.8907, 5.9716, 0.7815], dtype=torch.float64, requires_grad=True) Gradient: tensor([ -126.3959, 196.6331, -1121.6874, -38.7147, 330.9636], dtype=torch.float64) Loss: 1584.43241800054 Parameters: tensor([0.1601, 1.1099, 0.9692, 5.1924, 0.7148], dtype=torch.float64, requires_grad=True) Gradient: tensor([-32.1064, -8.6910, 202.2057, -32.2393, 54.9032], dtype=torch.float64) Loss: 1695.4965779039776 Parameters: tensor([0.1632, 1.1953, 0.8525, 7.2408, 0.8511], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 25.5735, 437.2981, -3735.0686, -9.3982, 708.6063], dtype=torch.float64) Loss: 1566.696388480875

SarodYatawatta commented 4 months ago

Is the parameter dtype of your LBFGS result also dtype=torch.float64? It seems they are float32 (cuda)

MicheleBellomo commented 4 months ago

I switched to float64 and i obtain the exact same problem.

CUDA is available. Running on GPU. Starting iteration number 1 torch.float64 Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0', dtype=torch.float64, requires_grad=True) Gradiente: tensor([ 2954.2419, 917.4897, -725.0009, -1112.7319, 542.9047], device='cuda:0', dtype=torch.float64) Loss: tensor([4152.0887], device='cuda:0', dtype=torch.float64, grad_fn=) torch.float64 Parametri: tensor([1.0000e-05, 1.0000e-05, 7.2600e+02, 1.1147e+03, 1.0000e-05], device='cuda:0', dtype=torch.float64, requires_grad=True) Gradiente: tensor([-1.1380e+08, 0.0000e+00, nan, nan, 0.0000e+00], device='cuda:0', dtype=torch.float64) Loss: tensor([13101.7452], device='cuda:0', dtype=torch.float64, grad_fn=) torch.float64 Parametri: tensor([5.0001e-01, 5.0001e-01, 3.6350e+02, 5.5837e+02, 2.0001e-01], device='cuda:0', dtype=torch.float64, requires_grad=True) Gradiente: tensor([1324.0228, 0.0000, nan, nan, 0.0000], device='cuda:0', dtype=torch.float64) Loss: tensor([2588.8081], device='cuda:0', dtype=torch.float64, grad_fn=) torch.float64 Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0', dtype=torch.float64, requires_grad=True) Gradiente: tensor([ 2954.2419, 917.4897, -725.0009, -1112.7319, 542.9047], device='cuda:0', dtype=torch.float64) Loss: tensor([4152.0887], device='cuda:0', dtype=torch.float64, grad_fn=) torch.float64 Parametri: tensor([2.5001e-01, 2.5001e-01, 5.4475e+02, 8.3655e+02, 1.0001e-01], device='cuda:0', dtype=torch.float64, requires_grad=True) Gradiente: tensor([-951.8634, 0.0000, nan, nan, 0.0000], device='cuda:0', dtype=torch.float64) Loss: tensor([2477.5958], device='cuda:0', dtype=torch.float64, grad_fn=)

SarodYatawatta commented 4 months ago

OK. Can you re-run this with torch.autograd.set_detect_anomaly(True) enabled?

MicheleBellomo commented 4 months ago

[<ipython-input-33-c97a27ea6b0a>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
    109         for iteration in range(max_iter): #tqdm(range(max_iter)):
    110             print (f'Starting iteration number {iteration+1}')
--> 111             loss=optimizer.step(closure)
    112             print(loss)
    113 

[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    389                             )
    390 
--> 391                 out = func(*args, **kwargs)
    392                 self._optimizer_step_code()
    393 

[<ipython-input-2-b8d8d7448b6c>](https://localhost:8080/#) in step(self, closure)
    544             if (line_search_flag):
    545                 if not batch_mode:
--> 546                   alpha=self._strong_wolfe(closure,f,g,p)
    547                 else:
    548                   if not cost_use_gradient:

[<ipython-input-2-b8d8d7448b6c>](https://localhost:8080/#) in _strong_wolfe(self, closure, f0, g0, p)
    417             self._copy_params_in(x0)
    418             self._add_grad(alpha_i,p)
--> 419             f_i=float(closure())
    420             g_i=self._gather_flat_grad()
    421             if (f_i>f0+c1*dphi0) or ((i>0) and (f_i>f_im1)):

[<ipython-input-33-c97a27ea6b0a>](https://localhost:8080/#) in closure()
     96             optimizer.zero_grad()
     97             loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98             loss.backward()
     99             print(θ.dtype)
    100             print("Parametri: ", θ)

[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
    523                 inputs=inputs,
    524             )
--> 525         torch.autograd.backward(
    526             self, gradient, retain_graph, create_graph, inputs=inputs
    527         )

[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    265     # some Python versions print out the first line of a multi-line function
    266     # calls in the traceback and some print out the last line
--> 267     _engine_run_backward(
    268         tensors,
    269         grad_tensors_,

[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
    742         unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    743     try:
--> 744         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    745             t_outputs, *args, **kwargs
    746         )  # Calls into the C++ engine to run the backward pass

RuntimeError: Function 'PowBackward1' returned nan values in its 0th output.
SarodYatawatta commented 4 months ago

It seems to be related to the update of parameters, I will look into this and get back to you

MicheleBellomo commented 4 months ago

Do you have any updates? In general, do you have serious intentions to develop this feature so that it can be introduced in PyTorch? It would be very important for me because I am developing an entire library to fit statistical models based on this feature. If necessary, I am willing to collaborate to help, even though optimization is not my main field of research.

SarodYatawatta commented 4 months ago

I have made some progress, it is not related to the LBFGS-B algorithm, but the way the parameters are updated and gradient calculated, which I am not doing the optimal way, I have not come across your problem in all the tests I have run, so I am working on a major overhaul of this part of the code, will appear on a branch later this week. If you can setup a smaller test case that would be great.

SarodYatawatta commented 4 months ago

Hi, I have added a branch 'linesearch_upgrade', so can you test your problem with the new version of the solver?

MicheleBellomo commented 4 months ago

running in batch_mode=True I obtain this error

7 frames
[<ipython-input-17-a19d033719e2>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
    109         for iteration in range(max_iter): #tqdm(range(max_iter)):
    110             print (f'Starting iteration number {iteration+1}')
--> 111             loss=optimizer.step(closure)
    112             print(loss)
    113 

[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    389                             )
    390 
--> 391                 out = func(*args, **kwargs)
    392                 self._optimizer_step_code()
    393 

[<ipython-input-3-137833124ec4>](https://localhost:8080/#) in step(self, closure)
    546                   if not cost_use_gradient:
    547                         torch.set_grad_enabled(False)
--> 548                   alpha=self._linesearch_backtrack(closure,f,g,p,self.alphabar)
    549                   if not cost_use_gradient:
    550                         torch.set_grad_enabled(True)

[<ipython-input-3-137833124ec4>](https://localhost:8080/#) in _linesearch_backtrack(self, closure, f_old, gk, pk, alphabar)
    370         xk=[x.clone() for x in x0list]
    371         self._add_grad(alphak,pk)
--> 372         f_new=float(closure())
    373         s=gk
    374         prodterm=c1*s.dot(pk)

[<ipython-input-17-a19d033719e2>](https://localhost:8080/#) in closure()
     96             optimizer.zero_grad()
     97             loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98             loss.backward()
     99             print(θ.dtype)
    100             print("Parametri: ", θ)

[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
    523                 inputs=inputs,
    524             )
--> 525         torch.autograd.backward(
    526             self, gradient, retain_graph, create_graph, inputs=inputs
    527         )

[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    265     # some Python versions print out the first line of a multi-line function
    266     # calls in the traceback and some print out the last line
--> 267     _engine_run_backward(
    268         tensors,
    269         grad_tensors_,

[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
    742         unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    743     try:
--> 744         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    745             t_outputs, *args, **kwargs
    746         )  # Calls into the C++ engine to run the backward pass

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

while running in batch mode=False I still obtain


7 frames
[<ipython-input-3-c97a27ea6b0a>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
    109         for iteration in range(max_iter): #tqdm(range(max_iter)):
    110             print (f'Starting iteration number {iteration+1}')
--> 111             loss=optimizer.step(closure)
    112             print(loss)
    113 

[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    389                             )
    390 
--> 391                 out = func(*args, **kwargs)
    392                 self._optimizer_step_code()
    393 

[<ipython-input-2-137833124ec4>](https://localhost:8080/#) in step(self, closure)
    542             if (line_search_flag):
    543                 if not batch_mode:
--> 544                   alpha=self._strong_wolfe(closure,f,g,p)
    545                 else:
    546                   if not cost_use_gradient:

[<ipython-input-2-137833124ec4>](https://localhost:8080/#) in _strong_wolfe(self, closure, f0, g0, p)
    415             self._copy_params_in(x0)
    416             self._add_grad(alpha_i,p)
--> 417             f_i=float(closure())
    418             if (f_i>f0+c1*dphi0) or ((i>1) and (f_i>f_im1)):
    419                 alpha=self._alpha_zoom(closure,x0,f0,g0,p,alpha_im1,alpha_i)

[<ipython-input-3-c97a27ea6b0a>](https://localhost:8080/#) in closure()
     96             optimizer.zero_grad()
     97             loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98             loss.backward()
     99             print(θ.dtype)
    100             print("Parametri: ", θ)

[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
    523                 inputs=inputs,
    524             )
--> 525         torch.autograd.backward(
    526             self, gradient, retain_graph, create_graph, inputs=inputs
    527         )

[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    265     # some Python versions print out the first line of a multi-line function
    266     # calls in the traceback and some print out the last line
--> 267     _engine_run_backward(
    268         tensors,
    269         grad_tensors_,

[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
    742         unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    743     try:
--> 744         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    745             t_outputs, *args, **kwargs
    746         )  # Calls into the C++ engine to run the backward pass

RuntimeError: Function 'PowBackward1' returned nan values in its 0th output.
SarodYatawatta commented 4 months ago

Did you pass cost_use_gradient=True to LBFGSB creation?

SarodYatawatta commented 3 months ago

Can you update if cost_use_gradient=True fixed the issue?

MicheleBellomo commented 3 months ago

using cost_use_gradient=True now I obtain the same error related to the gradient (RuntimeError: Function 'PowBackward1' returned nan values in its 0th output) both in batch_mode=True and batch_mode=False. So yes, it seems to fix "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn", but still have the original problem of nan gradient

SarodYatawatta commented 3 months ago

Ok, good to know, if you can provide me an example to reproduce the error it will be great.

MicheleBellomo commented 3 months ago

This and next week I'm very busy with some conferences. I will build one ad-hoc code in two weeks.