h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 2k forks source link

h2o flow unresponsive? #16331

Closed blgodwin closed 1 month ago

blgodwin commented 1 month ago

Connecting to h2o through R v. 4.3.3 H2O cluster version: 3.46.0.4

I connect to h2o through R and then begin to run a PCA model with a large dataset (30k+ datapoints) through the GUI webpage interface. I import the data, parse, it and view it, and the run the job. The job runs through all the iterations of alternating minimizations to where it says "100% progress." It takes about 1 minute 20 seconds. The progress bar does not change but the job status says "RUNNING." If I click "view" the Action a line in grey appears that says "getModel 'modelname'" but nothing more and a yellow bar appears at the bottom saying "Requesting http://localhost.54321/...."

Then it just stays this way. It has not given me a termination error, but it also remains unresponsive. I am unsure if it is just taking a long time to run or if it is indeed broken and not working. At the time of this writing I have let it sit for about 2 days. If I try to investigate the status using R it is also unresponsive.

Is this working as intended or have I encountered an error?

Thanks!

image

image

image

tomasfryda commented 1 month ago

Thank you for reporting it. It doesn't seem to be working as expected. Is it possible it ran out of memory?

Could you provide us with the backend logs(https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html)? It's likely that they would suffice to find out what is wrong but if you could provide us with more information it will get easier for us.

Would you be able to check if it is browser/serialization issue? I would start by right click on the page -> "Inspect" and look if there are some errors in "console" tab and if not would you be able to rerun it with "Network" tab open to see if there is a long reply that could block the backend?

blgodwin commented 1 month ago

h2o_127.0.0.1_54321-1-trace.log h2o_127.0.0.1_54321-2-debug.log h2o_127.0.0.1_54321-3-info.log h2o_127.0.0.1_54321-4-warn.log

Thanks so much for your response! I uploaded some of the log files I found. It does seem like it was an OOM error. I admit to being out of my depth here - h2o was suggested to me because the PCA was too big for my computer but I suppose I don't know how to set up h2o properly. Is it possible to get enough memory to run this analysis? If so, can you please explain how? You can see in my screenshot of RStudio above I requested 20GB but it said 9.98 cluster memory. I had requested 9 days prior, so perhaps I need to clear something? I'm assuming I will actually need to have much more than even 20GB?

Thanks again!

tomasfryda commented 1 month ago

In the RStudio screenshot, the h2o is already running and the h2o.init() just connects to the running instance so it can't change the memory size. You can shutdown the h2o and then start it again with the h2o.init() but please make sure to set also maximum memory size (just to rule out the possibility of being limited by maximum size being smaller than minimum).

I would also try to specify different pca_method parameters (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/pca.html).

@wendycwong do you have any other ideas?

blgodwin commented 1 month ago

Hello,

I tried specifying mim_mem_size and max_mem_size as seen below. Knowing that 10G was probably not enough, I just went up to 100G to see what happened. I uploaded my data through R and then went to the GUI interface to build the model with the hope that any errors would be clearly displayed there. Now when I go to build the PCA model, not even run it, it says H2O is no longer connected. I had the same error on both Chrome and Firefox.

What am I doing wrong?

Thanks again!

image image image

wendycwong commented 1 month ago

This is baffling. Your dataset size is not that big and your memory allocation is fine. There really is no reason to see the failure. Is it possible to share your dataset so we can reproduce the error here locally and fix it?

blgodwin commented 1 month ago

Of course! I'm sure there is some mistake on my end that I'm not expert enough to know to even mention here. It does have missing data if that helps troubleshoot, though I thought I picked the correct parameters to deal with that. I was going to test several combinations of the PCA methods (e.g., standardized or not) after I confirmed one of them worked.

Additionally, I was and still am able to get a PCA working with a smaller dataset.

Zipped data filed is attached. pntest_mean_SNP_PCA_noheader.zip

Thanks so much for your help! It's very much appreciated.

wendycwong commented 1 month ago

@blgodwin

Thank you so much for providing me with the information. Will try it out and let you know.

wendycwong commented 1 month ago

@blgodwin

I played with your dataset in Python and execute the following:

import h2o h2o.init(strict_version_check=False) data = h2o.import_file("pntest_mean_SNP_PCA_noheader.txt") from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator fitModel = H2OPrincipalComponentAnalysisEstimator(k=4, impute_missing=True) # you have many NA's in your dataset. fitModel.train(data.names, training_frame=data)

I got the following result:

image
wendycwong commented 1 month ago

However, I do run into one problem. When I set PCA_method="GLRM" like here:

fitModel = H2OPrincipalComponentAnalysisEstimator(k=4, PCA_method="glrm", ) use_all_factor_levels=True) fitModel.train(x=data.names, training_frame=data)

I will run into a NPE error. I have opened an issue to resolve this: https://github.com/h2oai/h2o-3/issues/16335

This is embarrassing.

wendycwong commented 1 month ago

okay, I loaded your dataset into Flow using chrome and choose the following parameters in my model building:

image

I was able to get a model out:

image

blgodwin commented 1 month ago

That's great! Thank you for taking the time to try that out!

I'm still having the same issue. Unless I'm missing something, I set up the PCA just like your screenshot above.

image

And I get an unresponsive error almost immediately. The only difference this time is that the progress bar doesn't go up to 100% before it quits.

image

I see this error in both Chrome and Firefox.

Is it something about how I'm connecting to h2o?

tomasfryda commented 1 month ago

@blgodwin I would recommend trying it in R or Python. Flow gets much less attention than clients for the aforementioned languages so there might be more bugs than in R or Python client.

One thing that is probably different is that we use macOS and linux for development and testing so it's possible the bug is related to the OS you use or it might be due to the newer version of Java. IIRC from logs you use Java that's not yet officially supported by us.

If you can run h2o on some different OS it might help. If that would be too complicated, you might try different java version (older; AFAIK we support java 8 to 17). Or you might try running h2o in windows subsystem for linux.

@wendycwong knows more about our PCA implementation so she might have some more ideas what to try if you feel uncomfortable with installing different java etc.

blgodwin commented 1 month ago

I got it to work if I did not specify any min_mem_size, max_mem_size, or threads!

PCA ran with a simple "localh2o = h2o.init()" to connect

tomasfryda commented 1 month ago

That's great! Thank you for mentioning how you resolved that!