[WIP] Adagrad solver for arbitrary orders.

vene commented 8 years ago

Implements the SG algorithm from Higher-order Factorization Machines. Mathieu Blondel, Akinori Fujino, Naonori Ueda, Masakazu Ishihata. In Proceedings of Neural Information Processing Systems (NIPS), December 2016.

Todos

[x] Explicit fitting of the lower matrices, as in CD
[x] Benchmarks
[ ] Consider avoiding the numpy import in cython
[ ] Unify callback API with CD before merging
[x] Test coverage.

EthanRosenthal commented 8 years ago

Is there anything I can do to help complete this PR?

vene commented 8 years ago

Hi @EthanRosenthal

I'm a bit busy these days, but I plan to focus on this project and get a release out by the end of the year.

The current problem with this PR is that the adagrad solver seems to take quite more time per epoch than CD. If you manage to pinpoint the issue, it would be great.

--(Let me try to push the work-in-progress benchmark code that I have around.)-- edit: it was already up

EthanRosenthal commented 8 years ago

@vene Sounds good - I'm happy to take a look and see if I can improve the performance.

vene commented 7 years ago

Making P and P-like arrays (especially the gradient output) fortran-ordered makes this ~1.5x faster, but coordinate descent solvers are still faster per iteration, it seems. Also weird that degree 3 is faster than degree 2. I set a very low tolerance to prevent it from converging early, but I should check why.

I just realized it might be because the alpha and beta regularization parameters have different meanings for the two solvers unless accounted for.

(with P in C order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3-ada          129.4632s    0.3683s     0.0000     0.9576
fm-2-ada          199.0683s    0.1267s     0.0000     0.9576
fm-2               15.9484s    0.1455s     0.1437     0.7594
polynet-3          15.5964s    0.0836s     0.1624     0.8425
fm-3               21.2889s    0.5154s     0.3695     0.9429
polynet-2          12.9678s    0.0915s     0.4800     0.9620

(with P (and everything else) in F order all the time)
Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-2-ada          138.9727s    0.1163s     0.0000     0.9576
fm-3-ada           72.2692s    0.3488s     0.0000     0.9576
fm-2               16.2693s    0.1453s     0.1437     0.7594
polynet-3          15.6552s    0.0843s     0.1624     0.8425
fm-3               21.4048s    0.5239s     0.3695     0.9429
polynet-2          13.0174s    0.0924s     0.4800     0.9620

mblondel commented 7 years ago

Indeed the regularization are not the same because of the 1/n factor in front of the loss when using stochastic gradient algorithms.

vene commented 7 years ago

Here's the performance after a bunch of tweaking and making the problem easier.

I'm printing out the norm of P to emphasize the big inconsistency of the solutions, even after correctly setting regularization terms. This is weird. When initializing P with standard deviation 0.01, adagrad sets it to zero very quickly. (especially with lower learning rates).


20 newsgroups
=============
X_train.shape = (1079, 130107)
X_train.dtype = float64
X_train density = 0.0013896454072061155
y_train (1079,)
Training class ratio: 0.4448563484708063
X_test (717, 130107)
X_test.dtype = float64
y_test (717,)

Classifier Training
===================
Training fm-2 ... done
||P|| = 20542.6551563
Training fm-2-ada ... done
||P|| = 49.2109170758
Training fm-3 ... done
||P|| = 71852.8549548
Training fm-3-ada ... done
||P|| = 37.0119191191
Training polynet-2 ... done
Training polynet-3 ... done
Classification performance:
===========================

Classifier            train       test         f1   accuracy
------------------------------------------------------------
fm-3                6.5862s    0.0839s     0.4524     0.5105
fm-2                4.5888s    0.0178s     0.4645     0.5690
fm-3-ada           20.1919s    0.0684s     0.4993     0.4965
polynet-3           5.4179s    0.0099s     0.5114     0.5523
polynet-2           4.1679s    0.0099s     0.5449     0.5621
fm-2-ada           18.6872s    0.0126s     0.5698     0.5788

vene commented 7 years ago

Appveyor crash is not a test failure, for some reason I get Command exited with code -1073740940. No idea why right now...

vene commented 7 years ago

Now supports explicit fitting of lower orders. No performance degradation but the code is a bit unrolled and could be written clearer.

I'm a bit confused by the way adagrad reacts to the learning rate, especially in the example, and why it seems to shrink to 0 faster with lower learning rates. But the tests, at least, suggest things are sensible.

vene commented 7 years ago

On second thought, the windows crash is not a fluke.

The exit code -1073740940, in hex, is 0xC0000374, which apparently means heap corruption

It seems that my last commit fixed it.

scikit-learn-contrib / polylearn

[WIP] Adagrad solver for arbitrary orders. #8