sonyxperiadev / gerrit-events

MIT License
47 stars 62 forks source link

Fix [JENKINS-44568]: restore old ways of handling ssh disconnection #66

Closed Jimilian closed 7 years ago

Jimilian commented 7 years ago

If I understood the source code correctly, since https://github.com/sonyxperiadev/gerrit-events/commit/57163bb97b35aeb4ddf4c035692c424db71a86b8 re-connect is broken. Because we do not "close" the sshConnection, as it was done before. So sshConnection object assumes that connection still exists.

As I mentioned in https://issues.jenkins-ci.org/browse/JENKINS-44568, I'm not sure that this PR really fixes the stuff, because I didn't have time to test it, so, let's wait for some feedback.

ssbarnea commented 7 years ago

I am a little bit confused because this gives me the impression that it disconnects right after receiving the data which would make the idea of polling quite useless.

I doubt we can afford to be disconnected too often or for long periods of time because we will lose events, especially because not all servers do have the ability to play missed ones.

If this is fixing the current problem it means that somehow we can endup with stalled SSH sessions, ones that do report as connected but which in fact are not receiving any data.

rinrinne commented 7 years ago

Because we do not "close" the sshConnection, as it was done before.

Reconnection is performed if calling reconnect(). In this method, sshConnection.disconnect() is called. By this, sshConnection.isConnected() returns false. Also sshConnection.isClosed() returns true, Finally null is set to sshConnction inside of finally block in run().

sshConnection has possibility to be reused for executing other SSH commands on a generated SSH Channel. So it should not be updated in this location. Gerrit stream is output of a SSH command performed on a SSH channel. So please handle SSH Channel rather than SSH Connection.

Updating line 439 to "if (channel != null) {" would help you.

Jimilian commented 7 years ago

@ssbarnea, this code that I copy-pasted from old version must be called only in case of broken ssh connection or -1 from read (means end - channel is closed). Maybe we should distinguish this two, but in my PR I just tried to return something that was in place before to give it a try in a real life.


@rinrinne, maybe I understood the bug in wrong way, but here that we have: 1) read at some point starts to return 0, it does not return -1. 2) it's not clear how code moved to this state at first point.

So, I can only assume that: 1) connection was broken or '-1` was received 2) after that we started to re-use "broken" connection.

I can replace line 439, but I don't see how it could help. But I agree that there is some misalignment between channel.isClosed()/channel.close() and channel.isConnected()/channel.connect().

Regarding the rest:

Reconnection is performed if calling reconnect()

Correct me if I'm wrong, but it must be called explicitly. Before it was done implicitly by calling connect each time inside the loop. I can put reconnect instead of re-introduced code if you think that it improves readability. (small addition: nullifyWatchdog shutdowns StreamWatchdog on the line 438 - so, reconnect can not be called from watchdog anymore).

In this method, sshConnection.disconnect() is called

Do you mean current PR or upstream code? In current PR I call it, just to restore old behaviour (and trigger reconnection implicitly). If you think that disconnect is called in current upstream version as well, please, point me to this call, I failed to find it in new version.

rinrinne commented 7 years ago

Sorry for late reply. Maybe #67 is root cause.

  1. read at some point starts to return 0, it does not return -1.

-1 indicates connection is broken. 0 means buffer.read() reads nothing from InputStream. It is caused by:

  1. No incoming data is arrived.
  2. Buffer for stored data is already full.

In this case, I guess that '2' caused this issue.

I can replace line 439, but I don't see how it could help.

Replacing line 439 makes channel disconnecting without checking channel status if channel has been created.

Before it was done implicitly by calling connect each time inside the loop.

Before introduced 57163bb, SshChannel was managed within SshConnection. So needed SshConnection.disconnect(). But after introducing non-blocking operation, SshChannel can be accessed within GerritConnection. So no need to treat SshConnection.disconnect() for each Gerrit stream-events.

Regarding JSch source code, channel.connect() has actual connection to remote service.

(small addition: nullifyWatchdog shutdowns StreamWatchdog on the line 438 - so, reconnect can not be called from watchdog anymore).

nullfyWatchdog() is called within finally block. It means that while loop for consuming stream is already exited. If GerritConnection.shutdown() is not called explicitly, consuming loop will be re-entered. This is the same as reconnection process.

Do you mean current PR or upstream code?

Means upstream code. So I submit my comment to this PR instead of each lines in your commit.

rsandell commented 7 years ago

So I now have three different approaches claiming to fix JENKINS-44568 and JENKINS-44414; #66, #68 and #69 That is great! but also confusing of which one is more credible. From what I can tell #68 has been soaking in the "real world" and claims to be working there and that is always a good sign and gets more credibility because of that. But #69 actually has unit tests that verifies something along the lines of the problem so we don't get regressions in the future which makes me more inclined to merge that.

66 has a healthy code review discussion that also makes it a bit safer to merge since it has had more eyes on it.

So which one should I go for?

mawinter69 commented 7 years ago

I'm sure that this #66 will not fix the problem, so does #67 not solve the problem though due to the high buffer of 256kb it becomes very unlikely to happen.

rinrinne commented 7 years ago

+1 to @mawinter69.

66 handles connection but it is not essentially needed. As I mentioned, this may affect other channels created from the same connection.

Regarding #67, I had similar concern with @mawinter69. So I started to create a patch to remove buffer size issue.

68 was submitted during my implementation. It solves such concern by another approach. Actually I just hesitated to upload my patch. However, I decided to do it because I thought that it would be helpful for something.

So I think it is better to adopt either #68 or #69.

Jimilian commented 7 years ago

I think somebody should give a try to #69 in real world or we can ask @mawinter69 to add a more unit tests (basically, it's the same test cases that were introduced in #69).

I agree that my PR doesn't solve the problem and I will close it after the real fix will be merged. But also I introduced JVM parameter to change the default size of SSH_RX_BUFFER and it can be used in case of different "reality".

Jimilian commented 7 years ago

https://github.com/sonyxperiadev/gerrit-events/pull/68 was merged. Close this one.