h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Java client JVM blocks on 2nd invocation - latch/lock leak on server #11743

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I'm testing scenario:

1 Server node with default parameters OK

2 Client node from R OK

3 Client node from Java (exiting JVM) OK

  1. The same Client node from Java executed BLOCKS with message in the log "INFO: Locking cloud to new members, because water.fvec.NFSFileVec"

The server log: {quote} ... 08-31 14:20:01.241 10.0.1.6:54321 51786 #ogThread WARN: Client 10.0.1.6/10.0.1.6:54323 disconnected! 08-31 14:20:12.865 10.0.1.6:54321 51786 FJ-126-15 INFO: New client discovered at 10.0.1.6/10.0.1.6:54323 {quote}

Looks all good on the server, the R client continues operate as expected, Flow is working, but Java client is just stuck ...

The code Client performs: {code} H2O.main("-name", "pranas", "-md5skip", "-log_level", "DEBUG", "-client"); H2O.waitForCloudSize(1, TimeUnit.SECONDS.toMillis(10)); final Key key = Key.make("fromJava"); final NFSFileVec lazy = NFSFileVec.make("ts.csv"); final Frame fr = ParseDataset.parse(key, lazy._key); final Vec column = fr.vec("demand"); for (int i = 0; i < column.length(); i++) { System.out.println(String.format("Value@%s is %s", i, column.at(i))); } System.out.println("Key of the frame" + fr._key);

DKV.remove(fr._key);
final Value removed = DKV.get(fr._key);
System.out.println("Retrieve after removal " + removed);

{code}

Code is based on https://github.com/h2oai/h2o-droplets/tree/master/h2o-java-droplet Just -client option is used in the unit-test.

P

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f] might be related to the issue with SW

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f] do you think this could be related to the hangs you found?

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: It seems like they are using 1 client and and h2o node. This seems like a bit different issue - maybe the node lost the information about the client because of the network and then the client reconnected, maybe causing some issues.

I'l be finalising the change for the hang issue on Monday and I will test this scenario as well so we can tell if it's related or not for sure.

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: Thinking more about this, this seems like symptoms of the issue fixed by [PUBDEV-5206], the same issue with client propagation to the rest of the nodes

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] I was looking into this today as well just very shortly and I can reproduce it.

Basically, client reconnection is broken. I will have a look at this next week as this might be good thing to be fixed.

The goal is to ensure that if the client disconnects it can connect again later. The client is not part of the cluster so it shouldn't affect the cluster

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: This is just a speculation and I couldn't see anything related to this in logs - but it could also affect the external sparkling water cluster. Even though we have a consensus task checking that the client has been disconnected from all the nodes, it might be possible that when client disconnects just from a single node and then reconnect, it leads to deadlock as well

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: Please see https://github.com/h2oai/h2o-3/pull/2054 with the fix

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] just curious, why the postponing?

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f], I am trying to put only the essential fixes in the current fix release.

Let's keep it in mind and if keep can get it in. Right now I am too busy to review the PR and I want to be super careful about fixes in clouding.

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: No worries, that's a valid point. I was just wondering :) Thanks for the explanation!

exalate-issue-sync[bot] commented 1 year ago

Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] I'm aware of your point above, but I'm just checking if it wouldn't be possible to put this change in at the end? From my point of view, this is a critical fix ensuring client disconnecting and reconnection works correctly.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4865 Assignee: Jakub Hava Reporter: Pranas Baliuka State: Resolved Fix Version: 3.20.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/2054