Closed alphaville closed 5 years ago
Another related question: I see that in the following function:
fn new_s_and_y_valid(&self, gradient: &[f64]) -> bool {
// TODO: Check if EPSILON should be changed
const EPSILON: f64 = 1e-12;
let sy = vec_ops::inner_product(&self.s.last().unwrap(), &self.y.last().unwrap())
/ vec_ops::inner_product(&self.y.last().unwrap(), &self.y.last().unwrap());
// TODO: Check the alpha power to be used
let ep = EPSILON * vec_ops::norm2(gradient);
// Condition: (y^T * s) / ||s||^2 > epsilon * ||grad(x)||^alpha
sy > ep && sy.is_finite() && ep.is_finite()
}
I see that the inner products are calculated on self.s.last()
and self.y.last()
, that is, on s
and y
variables which are already in the buffer. However, if the condition of Li and Fukushima is not satisfied, then these s
and y
should not be included in the buffer -- the buffer (that is, self.s
and self.y
) should, instead, not be updated.
Thank you, the first one is fixed with 57a0238 !
On the second one, s.last()
and y.last()
are only temporary storage to calculate s
and y
from the inputs (this is why there is a +1 in the size of s
and y
), hence if the update is not valid they are not included and overwritten with the next input.
Only when they are accepted, then they are rotated to the start of the s
and y
vectors to be used for estimating the hessian, see lines 146 - 152 in lib.rs
:
// Check that the s and y are valid to use
if self.new_s_and_y_valid(g) {
self.old_state.copy_from_slice(state);
self.old_g.copy_from_slice(g);
// Move the new s_0 and y_0 to the front
self.s.rotate_right(1);
self.y.rotate_right(1);
I wrote the following test (I've changed the API a little, but you'll figure out what this does):
#[test]
fn lbfgs_s_and_y() {
let n = 2;
let mem = 5;
let mut lbfgs = Estimator::new(n, mem);
lbfgs.with_alpha(0.0).with_epsilon(0.0);
let mut g0 = [-0.28, 0.60];
let x0 = [-0.65, 1.18];
lbfgs.update_hessian(&g0, &x0);
lbfgs.apply_hessian(&mut g0);
let mut g1 = [1.7800, 1.7700];
let x1 = [-0.7600, -1.1100];
lbfgs.update_hessian(&g1, &x1);
lbfgs.apply_hessian(&mut g1);
assert_array_ae(&lbfgs.s[0], &[-0.1100, -2.2900], 1e-10);
assert_array_ae(&lbfgs.y[0], &[2.0600, 1.1700], 1e-10);
let mut g2 = [-1.8700, -1.0600];
let x2 = [-0.8500, -0.5800];
lbfgs.update_hessian(&g2, &x2);
lbfgs.apply_hessian(&mut g2);
println!("lbfgs = {:#?}", lbfgs);
assert_array_ae(&lbfgs.s[0], &[-0.0900, 0.5300], 1e-10);
assert_array_ae(&lbfgs.y[0], &[-3.6500, -2.8300], 1e-10);
assert_array_ae(&lbfgs.s[1], &[-0.1100, -2.2900], 1e-10);
assert_array_ae(&lbfgs.y[1], &[2.0600, 1.1700], 1e-10);
}
Note: assert_array_ae
is the following function used for testing:
fn assert_array_ae(x: &[f64], y: &[f64], tol: f64) {
x.iter()
.zip(y.iter())
.for_each(|(&xi, &yi)| assert!((xi - yi).abs() < tol));
}
This prints the following:
lbfgs = Estimator {
active_size: 3,
gamma: -0.05491435161311496,
s: [
[
-0.08999999999999997,
0.5300000000000001
],
[
-0.10999999999999999,
-2.29
],
[
-0.65,
1.18
],
[
1.0,
1.0
],
[
1.0,
1.0
],
[
1.0,
1.0
]
],
y: [
[
-3.6500000000000004,
-2.83
],
[
2.06,
1.17
],
[
-0.28,
0.6
],
[
0.0,
0.0
],
[
0.0,
0.0
],
[
0.0,
0.0
]
],
alpha: [
0.3359228273860338,
-0.11053784279012983,
0.3304518607411522,
0.0,
0.0
],
rho: [
-0.8536793580331222,
-0.34412746481296674,
1.1235955056179776,
0.0,
0.0
],
old_state: [
-0.85,
-0.58
],
old_g: [
-1.87,
-1.06
],
cbfgs_alpha: 0.0,
cbfgs_epsilon: 0.0
}
Shouldn't active_size
be equal to 2
? Moreover, s
and y
seem to store a third value - in fact y[2]
seems to be equal to g0
and s[2]
is equal to x0
. Is that necessary?
Thanks for the comments, here are the answers:
active_size
should be 3, when there are 3 active buffers (which there are), so this is fine.
If you look at line 72-73
, active_size
is used non-inclusive.
s
and y
are storing the s
and y
under test by the C-BFGS criteria in the last element, and must be there if it is rejected, else we would overwrite old good gradients with the temporary, sadly it is necessary when having the C-BFGS criteria.
I think we resolved all related issues, so I'll close it.
Regarding function
update_hessian
, the LBFGS buffer should be updated only if(s,y)
satisfies the condition in the paper of Li and Fukushima, otherwise, the pair should be discarded. This means thatold_state
andold_gradient
shouldn't be updated.That said, I think that the following should be moved below inside the if block
Additionally, since LBFGS can be used in applications other than gradient-based optimisation, we could rename
gradient
to something else. We could adopt the notation:s_k = x_{k+1} - x_k
y_k = g_{k+1} - g_k
, whereg_k = g(x_k)
andg
is some functionOften,
g
is the gradient of a cost function.