Open MicheleBellomo opened 4 months ago
Is this with batch_mode=True or False? Can you try both?
The problem described happens with batch_mode=False.
using batch_mode=True the optimization doesn't start, outputting the following error:
/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py in init(self, params, defaults) 282 283 for param_group in param_groups: --> 284 self.add_param_group(cast(dict, param_group)) 285 286 # Allows _cuda_graph_capture_health_check to rig a poor man's TORCH_WARN_ONCE in python,
/usr/local/lib/python3.10/dist-packages/torch/_compile.py in inner(*args, kwargs) 20 @functools.wraps(fn) 21 def inner(*args, *kwargs): ---> 22 import torch._dynamo 23 24 return torch._dynamo.disable(fn, recursive)(args, kwargs)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/init.py in
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py in
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/trace_rules.py in
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/init.py in
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/torch.py in
AttributeError: partially initialized module 'torch._dynamo' has no attribute 'external_utils' (most likely due to a circular import)
Thanks, is it possible to have an example to reproduce this error? Also, not that ''' This optimizer doesn't support per-parameter options and parameter groups (there can be only one) ''' Could this be the issue ?
I cannot provide code to reproduce the error because it is part of a large library with many modules for training a statistical model. Anyway, the second error seems to be generated simply by the "batch_mode=True" call. As for the parameters, I haven't set any specific settings, but there are multiple parameters to be optimized. Does your implementation allow optimizing only one parameter?
Yes, create an empty list of parameters and add what you need to solve to this list params=list() params.extend(list(net.parameters())) .... optimizer = LBFGSB(params, ....)
also I removed an obsolete file in this directory which might have caused the circular import
Perhaps I wasn't clear. I have multiple parameters, but they are all contained within the same container (a PyTorch tensor, to be precise). This doesn't seem to cause any issues. The problems arise during the execution of the program after some closure evaluations.
OK, how many iterations are used before values turn to NaN, and what is the history_size?
here the log of parameters, gradients and losses obtained with batch_mode=False. note that the gradients become nan far before the parameters. this is an error arising from your optimizer, as with other library as scipy didn't have such a issue. i report also the closure function
def closure():
optimizer.zero_grad()
loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
loss.backward()
print("Parametri: ", θ)
print("Gradiente: ", θ.grad)
print("Loss: ", loss)
return loss
CUDA is available. Running on GPU.
Starting iteration number 1
Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0',
requires_grad=True)
Gradiente: tensor([ 2954.1091, 917.6224, -725.3356, -1112.6702, 542.8693],
device='cuda:0')
Loss: tensor([4152.4517], device='cuda:0', grad_fn=
It seems you are taking a log() somewhere, if the input is ~ 0, gradient can be NaN, So this is outside the optimizer, something within your negative_log_likelihood(), try setting torch.autograd.set_detect_anomaly(True) and see where the invalid calculation happens, also try gradient clipping, or adding a small value to the input of log() to make it > 0 (also can try softplus())
Yes, I use a logarithm, but with the constraints imposed by L-BFGS-B there should be no problems. As previously mentioned, I have made several implementations of this training and have never had any issues. For example, I have one that leverages the scipy implementation of L-BFGS-B and to which I pass the exact gradient calculated through automatic differentiation from PyTorch. Obviously, this solution is suboptimal, as I cannot take advantage of parallelization and I have to keep switching from PyTorch tensors to numpy tensors. This is why I need L-BFGS-B in the native PyTorch environment. At the end of the comment, I report the logs of the first iterations with the scipy solutions. Before goining on, I need to know if you have thoroughly tested your algorithm and if you are reasonably sure of its implementation.
Loss: 2060.643481812607 Parameters: tensor([0.1800, 1.0000, 1.0000, 2.0000, 0.4000], dtype=torch.float64, requires_grad=True) Gradient: tensor([1554.5775, 639.9076, -455.3975, -774.7195, 377.7018], dtype=torch.float64) Loss: 1877.9156827566933 Parameters: tensor([0.1798, 0.9989, 1.5068, 2.8621, 0.3996], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 289.8926, -190.1614, 372.5729, 68.1309, -189.4678], dtype=torch.float64) Loss: 2069.0895568747283 Parameters: tensor([0.0186, 1.0024, 1.2659, 2.5794, 0.4189], dtype=torch.float64, requires_grad=True) Gradient: tensor([-9632.7549, -265.7956, 383.9284, 283.5783, -194.9197], dtype=torch.float64) Loss: 1841.669200119487 Parameters: tensor([0.1203, 1.0002, 1.4178, 2.7577, 0.4067], dtype=torch.float64, requires_grad=True) Gradient: tensor([-605.2114, -195.3442, 386.1941, 70.2817, -185.4791], dtype=torch.float64) Loss: 1793.1294510600121 Parameters: tensor([0.1091, 1.0019, 1.2501, 2.5340, 0.4153], dtype=torch.float64, requires_grad=True) Gradient: tensor([-370.7308, -43.2679, 203.7400, -64.2272, -74.5376], dtype=torch.float64) Loss: 1772.939724053375 Parameters: tensor([0.1229, 1.0060, 1.1699, 2.5699, 0.4397], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 6.4114, 30.0546, 119.5214, -123.5402, -17.4146], dtype=torch.float64) Loss: 1743.154892900387 Parameters: tensor([0.1501, 1.0175, 1.0074, 2.8174, 0.5124], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 476.3682, 172.3165, -137.5379, -180.5257, 117.7273], dtype=torch.float64) Loss: 1717.680596618049 Parameters: tensor([0.1731, 1.0219, 1.0154, 3.0818, 0.5458], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 661.7246, 114.0387, -55.1897, -138.1382, 83.7804], dtype=torch.float64) Loss: 1669.908914956494 Parameters: tensor([0.1822, 1.0322, 0.9550, 3.6707, 0.6240], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 647.9612, 126.3050, -181.1554, -102.1170, 134.4563], dtype=torch.float64) Loss: 1616.7065713515212 Parameters: tensor([0.1641, 1.0895, 1.0206, 4.6823, 0.6711], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 46.6102, -70.7506, 447.9215, -22.0836, -29.7677], dtype=torch.float64) Loss: 1594.680298256697 Parameters: tensor([0.1540, 1.1410, 0.8907, 5.9716, 0.7815], dtype=torch.float64, requires_grad=True) Gradient: tensor([ -126.3959, 196.6331, -1121.6874, -38.7147, 330.9636], dtype=torch.float64) Loss: 1584.43241800054 Parameters: tensor([0.1601, 1.1099, 0.9692, 5.1924, 0.7148], dtype=torch.float64, requires_grad=True) Gradient: tensor([-32.1064, -8.6910, 202.2057, -32.2393, 54.9032], dtype=torch.float64) Loss: 1695.4965779039776 Parameters: tensor([0.1632, 1.1953, 0.8525, 7.2408, 0.8511], dtype=torch.float64, requires_grad=True) Gradient: tensor([ 25.5735, 437.2981, -3735.0686, -9.3982, 708.6063], dtype=torch.float64) Loss: 1566.696388480875
Is the parameter dtype of your LBFGS result also dtype=torch.float64? It seems they are float32 (cuda)
I switched to float64 and i obtain the exact same problem.
CUDA is available. Running on GPU.
Starting iteration number 1
torch.float64
Parametri: tensor([1.0000, 1.0000, 1.0000, 2.0000, 0.4000], device='cuda:0',
dtype=torch.float64, requires_grad=True)
Gradiente: tensor([ 2954.2419, 917.4897, -725.0009, -1112.7319, 542.9047],
device='cuda:0', dtype=torch.float64)
Loss: tensor([4152.0887], device='cuda:0', dtype=torch.float64,
grad_fn=
OK. Can you re-run this with torch.autograd.set_detect_anomaly(True) enabled?
[<ipython-input-33-c97a27ea6b0a>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
109 for iteration in range(max_iter): #tqdm(range(max_iter)):
110 print (f'Starting iteration number {iteration+1}')
--> 111 loss=optimizer.step(closure)
112 print(loss)
113
[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
389 )
390
--> 391 out = func(*args, **kwargs)
392 self._optimizer_step_code()
393
[<ipython-input-2-b8d8d7448b6c>](https://localhost:8080/#) in step(self, closure)
544 if (line_search_flag):
545 if not batch_mode:
--> 546 alpha=self._strong_wolfe(closure,f,g,p)
547 else:
548 if not cost_use_gradient:
[<ipython-input-2-b8d8d7448b6c>](https://localhost:8080/#) in _strong_wolfe(self, closure, f0, g0, p)
417 self._copy_params_in(x0)
418 self._add_grad(alpha_i,p)
--> 419 f_i=float(closure())
420 g_i=self._gather_flat_grad()
421 if (f_i>f0+c1*dphi0) or ((i>0) and (f_i>f_im1)):
[<ipython-input-33-c97a27ea6b0a>](https://localhost:8080/#) in closure()
96 optimizer.zero_grad()
97 loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98 loss.backward()
99 print(θ.dtype)
100 print("Parametri: ", θ)
[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
523 inputs=inputs,
524 )
--> 525 torch.autograd.backward(
526 self, gradient, retain_graph, create_graph, inputs=inputs
527 )
[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
265 # some Python versions print out the first line of a multi-line function
266 # calls in the traceback and some print out the last line
--> 267 _engine_run_backward(
268 tensors,
269 grad_tensors_,
[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
742 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
--> 744 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
745 t_outputs, *args, **kwargs
746 ) # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'PowBackward1' returned nan values in its 0th output.
It seems to be related to the update of parameters, I will look into this and get back to you
Do you have any updates? In general, do you have serious intentions to develop this feature so that it can be introduced in PyTorch? It would be very important for me because I am developing an entire library to fit statistical models based on this feature. If necessary, I am willing to collaborate to help, even though optimization is not my main field of research.
I have made some progress, it is not related to the LBFGS-B algorithm, but the way the parameters are updated and gradient calculated, which I am not doing the optimal way, I have not come across your problem in all the tests I have run, so I am working on a major overhaul of this part of the code, will appear on a branch later this week. If you can setup a smaller test case that would be great.
Hi, I have added a branch 'linesearch_upgrade', so can you test your problem with the new version of the solver?
running in batch_mode=True I obtain this error
7 frames
[<ipython-input-17-a19d033719e2>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
109 for iteration in range(max_iter): #tqdm(range(max_iter)):
110 print (f'Starting iteration number {iteration+1}')
--> 111 loss=optimizer.step(closure)
112 print(loss)
113
[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
389 )
390
--> 391 out = func(*args, **kwargs)
392 self._optimizer_step_code()
393
[<ipython-input-3-137833124ec4>](https://localhost:8080/#) in step(self, closure)
546 if not cost_use_gradient:
547 torch.set_grad_enabled(False)
--> 548 alpha=self._linesearch_backtrack(closure,f,g,p,self.alphabar)
549 if not cost_use_gradient:
550 torch.set_grad_enabled(True)
[<ipython-input-3-137833124ec4>](https://localhost:8080/#) in _linesearch_backtrack(self, closure, f_old, gk, pk, alphabar)
370 xk=[x.clone() for x in x0list]
371 self._add_grad(alphak,pk)
--> 372 f_new=float(closure())
373 s=gk
374 prodterm=c1*s.dot(pk)
[<ipython-input-17-a19d033719e2>](https://localhost:8080/#) in closure()
96 optimizer.zero_grad()
97 loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98 loss.backward()
99 print(θ.dtype)
100 print("Parametri: ", θ)
[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
523 inputs=inputs,
524 )
--> 525 torch.autograd.backward(
526 self, gradient, retain_graph, create_graph, inputs=inputs
527 )
[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
265 # some Python versions print out the first line of a multi-line function
266 # calls in the traceback and some print out the last line
--> 267 _engine_run_backward(
268 tensors,
269 grad_tensors_,
[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
742 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
--> 744 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
745 t_outputs, *args, **kwargs
746 ) # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
while running in batch mode=False I still obtain
7 frames
[<ipython-input-3-c97a27ea6b0a>](https://localhost:8080/#) in train(self, T, F_T, max_iter, tol)
109 for iteration in range(max_iter): #tqdm(range(max_iter)):
110 print (f'Starting iteration number {iteration+1}')
--> 111 loss=optimizer.step(closure)
112 print(loss)
113
[/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
389 )
390
--> 391 out = func(*args, **kwargs)
392 self._optimizer_step_code()
393
[<ipython-input-2-137833124ec4>](https://localhost:8080/#) in step(self, closure)
542 if (line_search_flag):
543 if not batch_mode:
--> 544 alpha=self._strong_wolfe(closure,f,g,p)
545 else:
546 if not cost_use_gradient:
[<ipython-input-2-137833124ec4>](https://localhost:8080/#) in _strong_wolfe(self, closure, f0, g0, p)
415 self._copy_params_in(x0)
416 self._add_grad(alpha_i,p)
--> 417 f_i=float(closure())
418 if (f_i>f0+c1*dphi0) or ((i>1) and (f_i>f_im1)):
419 alpha=self._alpha_zoom(closure,x0,f0,g0,p,alpha_im1,alpha_i)
[<ipython-input-3-c97a27ea6b0a>](https://localhost:8080/#) in closure()
96 optimizer.zero_grad()
97 loss = self.negative_log_likelihood(T, F_T, θ, len_θ_mu)
---> 98 loss.backward()
99 print(θ.dtype)
100 print("Parametri: ", θ)
[/usr/local/lib/python3.10/dist-packages/torch/_tensor.py](https://localhost:8080/#) in backward(self, gradient, retain_graph, create_graph, inputs)
523 inputs=inputs,
524 )
--> 525 torch.autograd.backward(
526 self, gradient, retain_graph, create_graph, inputs=inputs
527 )
[/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py](https://localhost:8080/#) in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
265 # some Python versions print out the first line of a multi-line function
266 # calls in the traceback and some print out the last line
--> 267 _engine_run_backward(
268 tensors,
269 grad_tensors_,
[/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py](https://localhost:8080/#) in _engine_run_backward(t_outputs, *args, **kwargs)
742 unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
743 try:
--> 744 return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
745 t_outputs, *args, **kwargs
746 ) # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'PowBackward1' returned nan values in its 0th output.
Did you pass cost_use_gradient=True to LBFGSB creation?
Can you update if cost_use_gradient=True fixed the issue?
using cost_use_gradient=True now I obtain the same error related to the gradient (RuntimeError: Function 'PowBackward1' returned nan values in its 0th output) both in batch_mode=True and batch_mode=False. So yes, it seems to fix "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn", but still have the original problem of nan gradient
Ok, good to know, if you can provide me an example to reproduce the error it will be great.
This and next week I'm very busy with some conferences. I will build one ad-hoc code in two weeks.
I'm trying to use your implementation to faster optimize a problem that I've already trated using different optimizers and libraries. During the first iteration of LFBGS_B, the losses are in the first steps calculated correctly (and they are correctly decreasing), but then suddenly they become nan, and the same happen to the parameters being optimized. What can cause this behavior?