Closed jolars closed 2 years ago
@jolars I see almost no difference with and without line search:
I would expect LS to do a big difference for PGD. Is it normal ?
Got it. In the experiment the true L is 7.82 and with our LS strategy, the first L giving a valid step is 8, that we never decrease afterwards. Resetting L to 1 at iterations is much better:
But 1 could still be larger than the true L, so this is not a robust choice. Let me try a BB rule stepsize.
Here's the result with
I have put some print to check what the LSBB outputs as candidate for L. It is weird to me that it varies that much from one iteration to another. Any idea @jolars ?
iter 857
BB guess: 1.3306334061460234
L taken: 1.3306334061460234
iter 858
BB guess: 0.01572444409933038
L taken: 0.01572444409933038
iter 859
BB guess: 0.025714101634463915
L taken: 0.10285640653785566
iter 860
BB guess: 3.766654014877878
L taken: 3.766654014877878
Epoch: 861, loss: 19.784095140953134, gap: 1.70e-07
iter 861
BB guess: 3.8571876660201605
L taken: 3.8571876660201605
iter 862
BB guess: 2.5395177773296265
L taken: 2.5395177773296265
Right. Also, is it possible that make_simulated_data()
and real-sim
just work better because of how the data is simulated?
Anyway, how about something like this:
if it % 10 == 0:
L *= 0.9
I don't know anything about Barzilai-Borwein. Is it robust? I briefly read a little in https://arxiv.org/abs/1910.07056 and it seems like it may need some babysitting?
Intuitively what I would try to do is:
Start at previous L (first L: 1)
if condition is satisfied:
multiply by 0.9 as long as it is satisfied
else:
multiply by 2 until it is satisfied.
what do you think ?
I don't know much BB, but I have heard it was very good in practice. For quadratics I think there are convergence guarantees.
Intuitively what I would try to do is:
Start at previous L (first L: 1) if condition is satisfied: multiply by 0.9 as long as it is satisfied else: multiply by 2 until it is satisfied.
what do you think ?
I don't know much BB, but I have heard it was very good in practice. For quadratics I think there are convergence guarantees.
Sure, I like the general idea, but what exactly do you mean by condition
here?
The one we have now:
if q >= f * (1 - 1e-12):
break
Hm, seems like a clever approach, but I wonder if it's worth it computing all these inner products. Hm, let me try to benchmark a few alternatives.
Hm, I tested some strategies. See the code for details. It looks a little weird, maybe I messed something up.
I've tried to clean this pull request up a bit and make it ready to be merged. I've reduced the vanilla line search strategies to a single one and simplified the interface a bit. I've also added a safeguarding scheme from https://arxiv.org/abs/1910.07056 for the BB step size rule. Here are some results on real data sets:
My opinion is that we only need to include a maximum of three of these for the paper, and I would probably choose FISTA+line search, PGD+LS+BB, and maybe Anderson?
Do we want to bring this in or are we happy as it is? I think maybe we should just leave it be and avoid the additional code bloat.
Do we want to bring this in or are we happy as it is? I think maybe we should just leave it be and avoid the additional code bloat.
I agree. Let's just leave it
This pull request adds a backtracking line search procedure for proximal gradient descent.