Closed cxrodgers closed 4 years ago
Did you check if #278 fixes it for you?
Sorry for the delay. No, the problem is the same with #278.
I dug around a little bit more. The problem surfaces in the _cdfast
method. The variable z
is growing larger and larger. Eventually the Hessian hk
becomes zero and update
becomes infinite. After this point the weights become infinite or nan.
https://github.com/glm-tools/pyglmnet/blob/7f1fbb7feae4cd6f18d3d830a90e4b28f9fbfdaf/pyglmnet/pyglmnet.py#L678
I added a debug statement that prints out the first three values in z
on each call to cdfast
.
Lambda: 0.1000 z: array([0., 0., 0.]) z: array([2.47422425, 2.47422425, 2.47422425]) z: array([-0.75850352, -0.75850352, -0.75850352]) z: array([5.15118176, 5.15118176, 5.15118176]) z: array([-28.1899862, -28.1899862, -28.1899862]) z: array([497.96764796, 497.96764796, 497.96764796]) z: array([-519567.67103074, -519567.67103074, -519567.67103074]) /home/jack/dev/pyglmnet/pyglmnet/pyglmnet.py:679: RuntimeWarning: divide by zero encountered in divide update = 1. / hk * gk
It is strange, but I get exactly the same results with #278, even though the calculation of z
is different. Any tips on how I might debug this further?
This is a small zip file containing a script to reproduce the problem and the test data. https://www.dropbox.com/s/nvrg59t8acevaza/glm_example.zip?dl=1
After 2 hours of debugging, I think I found the problem. This is the script I was using to debug. I re-read the nice tutorial that @pavanramkumar wrote. From the reading, it seems to me that the caching inevitably requires careful handling of the first coordinate. Following this hunch, I modified this line:
to be:
for k in range(1, n_features + 1):
which seemed to (partially) fix the issue. Note that the exploding gradients is another issue that is related to the eta parameter which can be tweaked.
Based on this information, do you want to propose a proper fix in the form of a pull request? It will require some careful digging around and unfortunately I don't have that kind of time. However, I would be happy to review the pull request. cc @peterfoley605 you may also be interested in this.
Thanks so much for digging into this, that's awesome!! I will take a look and think about this some more. With the resources you provided I might be able to figure this out.
okay great. Do not hesitate to make pull requests so that others in the community may benefit from your fixes instead of having to relearn the same lessons!
@jasmainak @cxrodgers please see PR #348
I have a dataset for which the loss grows without bound using the cdfast solver, but converges normally with batch gradient. Below I've attached the datafiles to reproduce the error in case someone can figure out the problem. Otherwise, I would really appreciate any suggestions on how I can go about debugging this. Thanks!
Here's the problem happening:
Here's how it works with batch-gradient:
Here are the data
bad_endog
andbad_exog
for reproducing the problem. https://www.dropbox.com/s/v3llq1umhrfxu49/bad_endog?dl=1 https://www.dropbox.com/s/vf0bhko9auguf7p/bad_exog?dl=1 They are plaintext and can be loaded as follows: