loss grows to infinity with cdfast

cxrodgers commented 5 years ago

I have a dataset for which the loss grows without bound using the cdfast solver, but converges normally with batch gradient. Below I've attached the datafiles to reproduce the error in case someone can figure out the problem. Otherwise, I would really appreciate any suggestions on how I can go about debugging this. Thanks!

Here's the problem happening:

In [81]: glm = pyglmnet.GLM(verbose=True, solver='cdfast', max_iter=50, alpha=0)

In [82]: glm.fit(bad_exog, bad_endog)
Lambda: 0.1000
Iteration: 0. Loss: 2.510834
Iteration: 1. Loss: 3.258520
Iteration: 2. Loss: 6.065792
Iteration: 3. Loss: 33.602209
Iteration: 4. Loss: 526.502393
Iteration: 5. Loss: 520065.195320
Iteration: 6. Loss: inf
Iteration: 7. Loss: nan
Iteration: 8. Loss: nan
Iteration: 9. Loss: nan
Iteration: 10. Loss: nan
Iteration: 11. Loss: nan
... [Continues until max iter reached]

Here's how it works with batch-gradient:

In [83]: glm = pyglmnet.GLM(verbose=True, solver='batch-gradient', max_iter=50, alpha=0)

In [84]: glm.fit(bad_exog, bad_endog)
Lambda: 0.1000
Iteration: 0. Loss: 2.527621
Iteration: 1. Loss: 1.852870
Iteration: 2. Loss: 1.111185
Iteration: 3. Loss: 0.518860
Iteration: 4. Loss: 0.196907
Iteration: 5. Loss: 0.071565
Iteration: 6. Loss: 0.033076
Iteration: 7. Loss: 0.022561
Iteration: 8. Loss: 0.018103
Iteration: 9. Loss: 0.015213
Iteration: 10. Loss: 0.012827
Iteration: 11. Loss: 0.010914
Iteration: 12. Loss: 0.009259
Iteration: 13. Loss: 0.007895
Iteration: 14. Loss: 0.006712
Iteration: 15. Loss: 0.005725
Iteration: 16. Loss: 0.004872
Iteration: 17. Loss: 0.004156
Iteration: 18. Loss: 0.003538
Iteration: 19. Loss: 0.003017
Iteration: 20. Loss: 0.002570
Iteration: 21. Loss: 0.002191
Iteration: 22. Loss: 0.001867
Iteration: 23. Loss: 0.001591
Iteration: 24. Loss: 0.001356
Iteration: 25. Loss: 0.001156
    Converged in 26 iterations

Here are the data bad_endog and bad_exog for reproducing the problem. https://www.dropbox.com/s/v3llq1umhrfxu49/bad_endog?dl=1 https://www.dropbox.com/s/vf0bhko9auguf7p/bad_exog?dl=1 They are plaintext and can be loaded as follows:

bad_endog2 = np.loadtxt('bad_endog', dtype=np.int)
bad_exog2 = np.loadtxt('bad_exog', dtype=np.float)

jasmainak commented 5 years ago

Did you check if #278 fixes it for you?

cxrodgers commented 5 years ago

Sorry for the delay. No, the problem is the same with #278.

I dug around a little bit more. The problem surfaces in the _cdfast method. The variable z is growing larger and larger. Eventually the Hessian hk becomes zero and update becomes infinite. After this point the weights become infinite or nan. https://github.com/glm-tools/pyglmnet/blob/7f1fbb7feae4cd6f18d3d830a90e4b28f9fbfdaf/pyglmnet/pyglmnet.py#L678

I added a debug statement that prints out the first three values in z on each call to cdfast.

Lambda: 0.1000 z: array([0., 0., 0.]) z: array([2.47422425, 2.47422425, 2.47422425]) z: array([-0.75850352, -0.75850352, -0.75850352]) z: array([5.15118176, 5.15118176, 5.15118176]) z: array([-28.1899862, -28.1899862, -28.1899862]) z: array([497.96764796, 497.96764796, 497.96764796]) z: array([-519567.67103074, -519567.67103074, -519567.67103074]) /home/jack/dev/pyglmnet/pyglmnet/pyglmnet.py:679: RuntimeWarning: divide by zero encountered in divide update = 1. / hk * gk

It is strange, but I get exactly the same results with #278, even though the calculation of z is different. Any tips on how I might debug this further?

This is a small zip file containing a script to reproduce the problem and the test data. https://www.dropbox.com/s/nvrg59t8acevaza/glm_example.zip?dl=1

jasmainak commented 5 years ago

After 2 hours of debugging, I think I found the problem. This is the script I was using to debug. I re-read the nice tutorial that @pavanramkumar wrote. From the reading, it seems to me that the caching inevitably requires careful handling of the first coordinate. Following this hunch, I modified this line:

https://github.com/glm-tools/pyglmnet/blob/7f1fbb7feae4cd6f18d3d830a90e4b28f9fbfdaf/pyglmnet/pyglmnet.py#L655

to be:

for k in range(1, n_features + 1):

which seemed to (partially) fix the issue. Note that the exploding gradients is another issue that is related to the eta parameter which can be tweaked.

Based on this information, do you want to propose a proper fix in the form of a pull request? It will require some careful digging around and unfortunately I don't have that kind of time. However, I would be happy to review the pull request. cc @peterfoley605 you may also be interested in this.

cxrodgers commented 5 years ago

Thanks so much for digging into this, that's awesome!! I will take a look and think about this some more. With the resources you provided I might be able to figure this out.

jasmainak commented 5 years ago

okay great. Do not hesitate to make pull requests so that others in the community may benefit from your fixes instead of having to relearn the same lessons!

pavanramkumar commented 4 years ago

@jasmainak @cxrodgers please see PR #348

glm-tools / pyglmnet

loss grows to infinity with cdfast #289