Closed exalate-issue-sync[bot] closed 1 year ago
Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f] might be related to the issue with SW
Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f] do you think this could be related to the hangs you found?
Jakub Hava commented: It seems like they are using 1 client and and h2o node. This seems like a bit different issue - maybe the node lost the information about the client because of the network and then the client reconnected, maybe causing some issues.
I'l be finalising the change for the hang issue on Monday and I will test this scenario as well so we can tell if it's related or not for sure.
Jakub Hava commented: Thinking more about this, this seems like symptoms of the issue fixed by [PUBDEV-5206], the same issue with client propagation to the rest of the nodes
Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] I was looking into this today as well just very shortly and I can reproduce it.
Basically, client reconnection is broken. I will have a look at this next week as this might be good thing to be fixed.
The goal is to ensure that if the client disconnects it can connect again later. The client is not part of the cluster so it shouldn't affect the cluster
Jakub Hava commented: This is just a speculation and I couldn't see anything related to this in logs - but it could also affect the external sparkling water cluster. Even though we have a consensus task checking that the client has been disconnected from all the nodes, it might be possible that when client disconnects just from a single node and then reconnect, it leads to deadlock as well
Jakub Hava commented: Please see https://github.com/h2oai/h2o-3/pull/2054 with the fix
Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] just curious, why the postponing?
Michal Kurka commented: [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f], I am trying to put only the essential fixes in the current fix release.
Let's keep it in mind and if keep can get it in. Right now I am too busy to review the PR and I want to be super careful about fixes in clouding.
Jakub Hava commented: No worries, that's a valid point. I was just wondering :) Thanks for the explanation!
Jakub Hava commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] I'm aware of your point above, but I'm just checking if it wouldn't be possible to put this change in at the end? From my point of view, this is a critical fix ensuring client disconnecting and reconnection works correctly.
JIRA Issue Migration Info
Jira Issue: PUBDEV-4865 Assignee: Jakub Hava Reporter: Pranas Baliuka State: Resolved Fix Version: 3.20.0.1 Attachments: N/A Development PRs: Available
Linked PRs from JIRA
I'm testing scenario:
1 Server node with default parameters OK
2 Client node from R OK
3 Client node from Java (exiting JVM) OK
The server log: {quote} ... 08-31 14:20:01.241 10.0.1.6:54321 51786 #ogThread WARN: Client 10.0.1.6/10.0.1.6:54323 disconnected! 08-31 14:20:12.865 10.0.1.6:54321 51786 FJ-126-15 INFO: New client discovered at 10.0.1.6/10.0.1.6:54323 {quote}
Looks all good on the server, the R client continues operate as expected, Flow is working, but Java client is just stuck ...
The code Client performs: {code} H2O.main("-name", "pranas", "-md5skip", "-log_level", "DEBUG", "-client"); H2O.waitForCloudSize(1, TimeUnit.SECONDS.toMillis(10)); final Key key = Key.make("fromJava"); final NFSFileVec lazy = NFSFileVec.make("ts.csv"); final Frame fr = ParseDataset.parse(key, lazy._key); final Vec column = fr.vec("demand"); for (int i = 0; i < column.length(); i++) { System.out.println(String.format("Value@%s is %s", i, column.at(i))); } System.out.println("Key of the frame" + fr._key);
{code}
Code is based on https://github.com/h2oai/h2o-droplets/tree/master/h2o-java-droplet Just -client option is used in the unit-test.
P