h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

(due) GLM: Investigate long running glm multinomial with msgs like Got NonSPD matrix #12433

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

long running multinomial GLM job I'm seeing a lot of messages like this in the log:

{code:java} INFO: Got NonSPD matrix with original rho, re-computing with rho = 1.0E-5

{code} running multinomial GLM with 16 categories. I notice it's dramatically slower than binomial for the same amount of rows.. is this a known issue?

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #91543 (https://support.h2o.ai/helpdesk/tickets/91543) - Slow multinomial GLM

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Default GLM: 1 minute Lambda search enabled: 11 minutes Lambda search + alpha = 1: 15 minutes Lambda search + alpha + L_BFGS = 4 hours (estimated)

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Found bug with multinomial COD, the COD loop is run multiple times. Approximations are used in the loop which introduced errors. Filed JIRA: PUBDEV-5856. The work around right now is to specify solver to be other than COD. However, Michalk has done that and logged long run time. I need to experiment with what he did and verify it.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: I have rerun the following on my machine and here is the timing

default GLM multinomial: 323 seconds GLM multinomial + lambda_search: 3725.88 seconds

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Run again with lambda_search + irlsm 3387 second. Also, concerned about the following:

08-23 18:23:41.191 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] Class 0 got 4 active columns out of 88 total 08-23 18:23:41.191 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] Class 1 got 4 active columns out of 88 total 08-23 18:23:41.191 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] Class 2 got 4 active columns out of 88 total 08-23 18:23:41.191 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] Class 3 got 4 active columns out of 88 total 08-23 18:23:43.960 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] computed in 904+180+0+838=1922ms, step = 1.0, l1solver iter = 21, gerr = 7.385214423259756E-6 08-23 18:23:46.666 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] computed in 871+182+0+812=1865ms, step = 1.0, l1solver iter = 61, gerr = 2.6826751840195238E-5 08-23 18:23:54.411 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] Ls failed 08-23 18:23:57.773 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=231 lmb=.46E-4 obj=7.974E-4 imp=.89E-4 bdf=.16E-1] computed in 866+180+1+1486=2533ms, step = 1.0, l1solver iter = 45, gerr = 6.122621875838115E-6 08-23 18:24:01.679 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=232 lmb=.46E-4 obj=7.965E-4 imp=.12E-2 bdf=.17E0] computed in 884+196+0+850=1930ms, step = 1.0, l1solver iter = 22, gerr = 6.865258972511574E-6 08-23 18:24:09.522 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=232 lmb=.46E-4 obj=7.965E-4 imp=.12E-2 bdf=.17E0] Ls failed 08-23 18:24:17.225 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=232 lmb=.46E-4 obj=7.965E-4 imp=.12E-2 bdf=.17E0] Ls failed 08-23 18:24:19.905 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=232 lmb=.46E-4 obj=7.965E-4 imp=.12E-2 bdf=.17E0] computed in 869+176+0+816=1861ms, step = 1.0, l1solver iter = 47, gerr = 6.158095406164511E-6 08-23 18:24:25.210 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=233 lmb=.46E-4 obj=7.964E-4 imp=.12E-3 bdf=.17E-1] Ls failed 08-23 18:24:32.889 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=233 lmb=.46E-4 obj=7.964E-4 imp=.12E-3 bdf=.17E-1] Ls failed 08-23 18:24:40.786 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=233 lmb=.46E-4 obj=7.964E-4 imp=.12E-3 bdf=.17E-1] Ls failed 08-23 18:24:44.431 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=233 lmb=.46E-4 obj=7.964E-4 imp=.12E-3 bdf=.17E-1] Ls failed 08-23 18:24:45.290 192.168.86.20:54321 17869 FJ-1-7 INFO: GLM[dest=GLM_model_python_1535070483332_1, iter=234 lmb=.46E-4 obj=7.964E-4 imp=.0E0 bdf=.0E0] betaDiff < eps; betaDiff = 0.0, eps = 1.0E-4

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Verified and check that COD implementation is correct after fixing minor issues. I have derived the correct coefficient updates and compared it with our current implementation and here are the results:

Beta from GLM code Beta from Wendy Derivation Difference 0.056785411 0.056785411 0 0.015581227 0.015581653 -4.25485E-07 -0.008747349 -0.008765387 1.80378E-05 0.006234659 0.006237953 -3.29382E-06 0.067213492 0.067276653 -6.3161E-05 -1.002393431 -1.00344703 0.001053599 -0.040749651 -0.040750657 1.0058E-06 -0.018741232 -0.01875739 1.61579E-05 0.072655224 0.072719499 -6.42748E-05 -0.023517224 -0.023606253 8.90298E-05 0.00277984 0.002775264 4.57691E-06 -1.139473617 -1.14143751 0.001963893 0.020136464 0.020153179 -1.67143E-05 0.011974921 0.011969228 5.69307E-06 -0.04059671 -0.040551102 -4.56083E-05 0.011425446 0.011371866 5.35804E-05 -0.001950643 -0.001892778 -5.78653E-05 -1.159709265 -1.161613003 0.001903738

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: As long as we do not run the COD loop more than once and keep the number of predictors small, it should be okay.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Okay, I am resolving the issue with COD for multinomial and the work is captured in PUBDEV-5856. Please go there for further progress for COD.

Regarding the warning message with IRLSM, will investigate it in PUBDEV-5891.

The work around now is to just use COD for multinomial and enable lambda search but reduce the number of lambdas to search through. Default is 100 which can be prohibitive.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Run time experiments after PUBDEV-5856:

New Old     
COD runtime s   COD runtime s   
175.910 220.808 
157.648 183.608 
150.475 221.062 
179.849 238.393 
158.197 202.698 Runtime improvement %

Average 164.416 213.314 22.923

There is a small saving of 22.9% in run time after I put in the changes in PUBDEV-5856. However, the bigger change will come later after I decouple the calculation of hessian matrix from COD which only needs the diagonal elements.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: The warning message INFO: Got NonSPD matrix with original rho, re-computing with rho = 1.0E-5 comes from ADMM part of the algo. Refer to page 43 of Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers by Stephen Boyd et al. We are searching for a rho.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: From page 43 of Boyd et.al. document, we have the following:

!screenshot-1.png|thumbnail!

The warning message comes from when we choose a rho that makes the (ATA+rhoI) part non semi-positive definite. As we try to search for the best rho, there will be cases where the warning will come up. However, this is normal algorithm behavior. An error message will be thrown when we failed to find a rho to make (ATA+rhoI) non SPD.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5571 Assignee: Wendy Reporter: Nidhi Mehta State: Closed Fix Version: 3.22.0.2 Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: screenshot-1.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5571/screenshot-1.png