theophilusx / ssh2-sftp-client

a client for SSH2 SFTP
Apache License 2.0
802 stars 196 forks source link

uncaughtException Error: connect: read ECONNRESET #545

Closed ertl closed 1 month ago

ertl commented 1 month ago

My server crashes when the sftp server is not available, while creating a connection.

Im using:

NodeJS: v20.13.1 OS: Windows 10 ssh2: 1.15.0 ssh2-sftp-client: 10.0.3

  protected async doPoll(): Promise<{ url: string; success: boolean; error: string | null }> {
    try {
      await using sftp = await this.createConnection(); // connect & disconnect
      return { url: this.config.url, success: true, error: null };
    } catch (error: unknown) {
      this.logger.error('Polling-Request failed: ', error);
      return { url: this.config.url, success: false, error: error.toString() };
    }
  }

  protected async createConnection(options?: Client.ConnectOptions) {
    const sftpClient = new Client(this.config.name);
    const connectionInfo = options || (await this.getSFTPConnectionInfo(this.config, this.logger));
    const sftpWrapper = await sftpClient.connect(connectionInfo);
    if (sftpWrapper === undefined) {
      throw new Error('SFTP connection could not be established!');
    }
    return {
      client: sftpClient,
      [Symbol.asyncDispose]: async () => {
        try {
          await sftpClient?.end();
        } catch (error: unknown) {
          // if there is no SFTP connection available, we have no connection to close anyway, so we only log the other errors
          if (hasMessage(error) && error.message !== 'end: No SFTP connection available') {
            this.logger.error(error.message, error);
          }
        }
      }
    };
  }

ERROR:

  uncaught Error: connect: read ECONNRESET at uncaughtException Error: connect: read ECONNRESET
    at Client.fn (C:\prj\mylib\dist\apps\server\node_modules\ssh2-sftp-client\src\utils.js:20:22)
    at Client.emit (node:events:531:35)
    at Socket.<anonymous> (C:\prj\mylib\dist\apps\server\node_modules\ssh2\lib\client.js:807:12)
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at processTicksAndRejections (node:internal/process/task_queues:82:21)

C:\prj\mylib\dist\apps\server\node_modules\ssh2-sftp-client\src\utils.js:20
    const newError = new Error(`${name}: ${err.message}`);
                     ^
Error: connect: read ECONNRESET

image

theophilusx commented 1 month ago

I cannot reproduce this error using just plain ssh2-sftp-client (no typescript and no higher level abstraction/class or based on the code snippet supplied). This strongly suggests the problem is being introduced by your code or by incorrect use of the module. However, I cannot verify this based on supplied information.

In order for me to investigate further, I would need a complete minimal working code example that will reproduce the error.

I do note that the error is beding generated in utils.js line 20, whicdh is the general error handling function i.e. where the module raises all errors it encounters. This would indicate the problem is in your code and is due to a missing error handler in your code i.e. no try/catch in place to catch the error, so it bubbles up and causes your code to exit wsith the exception.

ertl commented 1 month ago

I don't think it's realted to missing try/catch blocks.

But I agree with you, the error is very hard to reproduce.

I've created following repository: https://github.com/ertl/ssh-error

Before you start, adopt the option constant with your ConnectionOptions.

To start the program run pnpm install && pnpm start

I can reproduce the error by doing this:

Pull the network cable and wait for an error. If the Programm didn't stop plug it in and repeat.

theophilusx commented 1 month ago

I have looked at your repo, but I'm not sure it is actually testing what you think it is testing. At the very least, it has a timing issue and you cannot know/guarantee where in the script the error is being raised To understand, a bit of background might help.

The ssh2 module which ssh2-sftp-client is based on is an event based module. It does not use promises at all. The ssh2-sftp-client module wraps that event based API in a promised based API. However, a promised based approach does not map onto an event based model terribly cleanly. The main problem is that the actual sftp connection to the remote serber is outside the promise - it spans across multiple interavctions which are promised based. The problem is there are periods where no promise is in play, but an event can occur. For example, you do the following ...

try { const sftp = new Client(); await sftp.connect(); const listing = await sftp.list(); await sftp.end(); } catch (err) { console.error(err); }

As soon as the call to connect has completed, a connection to the remote sftp server has been created. From this point, an event might be raised at any instant. You then make a call to get a directory listing from the remote server. While that call is running, a promise is being fulfilled. That promise sets up listeners for error, end and close events. If any of these events occur during the call to get the directory listing, the promise is rejected (the promise may also be rejected for other errors, such as the remote directory not existing). If there are no events and no other errors, the promise is accepted and the directory data is returned, a successful fulfillment. All good.

However, there is a problem. What happens if an event is raised BETWEEN the completion of the connect() call and the call to list()? There is no promise in play at that point. For example, what happens if you just happened to pull the network cable at the precise moment between the end of the fist call and the start of the second or between the end of the second and just before the call to end()?

The ssh2-sftp-client module does have some global listeners to catch such events, but what is it supposed to do with that information? There are no promises in play to communicate the event back to the client via a reject. The only thing the module can do to communicate this error back to client code is raise it and rely on the client code to catch it and handle it appropriately.

So basically, you have to write your client code with the assumption that at any time, it is possible the module will throw an exception which you need to manage and that such exceptions could happen at any point between the creation of the sftp connection and when it is closed, including outside any calls to the module i.e. you might create a connection and then go off to do other work, such as build up a buffer of data you want to upload and while you are doing that other work, the connection is lost or closed by remote server and an exception is raised. If nothing in your code catches and handles that exception, it will bubble up to the top level and kill your process.

This isn't the only problem with trying to wrap an event based API with a promise based on. Although very unlikely, it is possible for some errors to be 'lost'. Consider the situation where an error event occurs just after a promise has been accpeted but before there has been time to remove the listeners associated with the promise. This will result in the error listener tied to that promise being called, which will call the promises reject() method. However, promises only allow you to call one resolution method, either accept or reject and once you have, any subsequent calls are ignored. This means the error is lost. It will typically result in the connection being closed and a new error will be generated when the client code attempts to use the connection again. Downside here is client code will only know the connection was closed with no details on why as this info was in the lost error. The good news is this is rare because the execution window where it could occur is small.

ertl commented 1 month ago

Thank you very much for your work and your detailed answer. Unfortunately, this is not very satisfying, because we all know Murphy's law: ‘Anything that can go wrong, will go wrong.’

Is there anything we can do to prevent the application from crashing, e.g.: add an additional error listener.

Is it generally advisable to use the Promise based API?

theophilusx commented 1 month ago

I'm a little concerned by your questions you may still not be fully understanding the issues here. There is no point adding another listener in the module, there is already one in place. The problem is the module listener cannot know/decide what to do with the event when it is raised. All it can do is re-raise it as an exception which your code then needs to handle, bringing us full circle back to where we are now. The only solution is that your code needs to catch this exception and do something appropriate when it happens. What is appropriate depends on your application - it could mean calling code to re-open the connection, it could be sending an error message to the user or it could even be to just ignore the error. It is completely application dependent Note that this is not something unique to this module. Any application which has to work with a stateful on-going network connection has the same issue.

Regarding your question regarding whether it is best to use the lower level event API or ssh2-sftp-clients API, sadly the answer is it depends on the application your developing and what you are more comfortable working with. For me, if it is an application which just needs to upload/download some data to/from an sftp server, possibly as a regular discrete task, I would probably use ssh2-sftp-client, but if I was writing something like a dedicated sftp client like lftp or putty, where I want more control, I would use the lower level event API.

ertl commented 1 month ago

The thing is that the exception ignores every try/catch block I added. As you can see in my repository, there are try/catch blocks here, but they don't catch the error and the application crashes. So I don't really know how to react to the error.

I agree with you that it is probably an asynchronous timing problem. I just don't know what to do about it.

theophilusx commented 1 month ago

The only way that an exception can bubble up and kill your process is if the exception occurs outside a try/catch block. Remember, you cannot predict exactly when an event emitter raises the exception event. This is a common error people make when working wsith an event emitter. In fact, it is so common, there is even a warning about it in the node documentation (I think it is in the events section of the docs)

Try moving the try/catch in your main() function to outside the while loop. Do something like

function main() { try { ... } catch (err) { console.error(err); } }

Also, you may want to try running against the current git repo version. I plan to release a new version soon and the new version has some unrelated bug fixes (dealing with append() and consistent cleanup of file descriptors after a put() error). I have also made some minor changes to the global event listeners which won't change the behaviour your seeing, but still good to be working against what is going to be the next version.

ertl commented 1 month ago

I have tested it again with the code change you suggested with version 10.0.3. However, the problem that the programm is crashing, still exists.

theophilusx commented 1 month ago

The language standard is pretty clear about this. If an exception is thrown and there is an enclosing try/catch block, it will catch the exception and handle it however defined by the catch handler, there is no other alternative. An exception can only bubble up to the top level and kill the process if it is not caught by a catch clause. This means that if your script is still terminating due to the exception it is because your code is not catching it.

Make sure your test code is actually terminating and not just printing the error and exiting, which can look very much like it did when it terminates due t the exception.

The only other suggestion I have is to add a try/catch block at the top level i.e. around the call to main and add a small sleep after the call to main to extend time before the try block terminates. It is possible that your code is still exiting before your script receives the error event The sleep after your call to main will reduce the likelihood of this happening.

ertl commented 1 month ago

The point is that even in this simple example, it is not so easy to catch all exceptions, which causes the application to crash. So if I understand it correctly, the error only occurs if there is no ErrorListener. This can happen at any time BEFORE the connect() or AFTER the end() call. In the case of a network error, does this information really matter after the connection has been closed or before it is established? If not, wouldn't it be better do ignore the error instead of throwing it?

theophilusx commented 1 month ago

The problem with that suggestion is how do you know if the error has occurred during a point where it doesn't matter.

The error can still bubble up when there is an error listener as the error listener might re-throw the error. In fact, this is what the default error handler does. When the default error handler listener gets an error event, it logs the event and then throws a normal error, allowing the client code to then catch that error and do do something.

The fact you are having trouble catching the error is another issue. I don't know why you are unable to catch the error. This is partly because I don't know Typescript and I think possibly due to you using the 'using' and Symbol.asyncDispose stuff, which is not part of Jfs yet and has to be transpiled into the code by the ts compiler. It may have nothing to do with things, just that as I'm not familiar with ts or this newish feature, I don't know the full implications. I do know that because of the way JS and node in particular handle async and promise code in different execution stacks, things can be 'surprising', especially with respecvt to try/catch.

FYI I have spent a lot of time on this today and have run your test code hundreds of times. While I have been able to generate errorrs, every one of them has been caught by the try/catch block in the doPoll() function. I even added a sleep statement to increase the likelihood that an error would occur outside the connect stage but before the call to end. Still no luck, all errors were caught in that function and none bubbled up and killed the process.

The omly issue I did see was that on about 3 or 4 occasions the script hung and did not exit and did not return. This should not happen, but I'm still investigating. I suspect it is related to the other changes I mentioned and a separate issue.

I'm currently re-writing your tests without TS and the 'using' syntax and seeing if my recent changes have any impact on the hanging issue or if I can reproduce your issue. (I'm also trying to find a way to reproduce the issue faster as testing is taking far too long).

One last thing I would like to point out is that despite weekly downloads of over 400k, you are the only person experiencing this issue. This would indicate either it is an issue localised to your environment or one which rarely occurs in practice. I will spend a bit more time on it, but if I cannot reproduce the issue, I will have no choice other than to mark it as unreproducible and move on.

theophilusx commented 1 month ago

Good news and bad news.

I have been able to reproduce the issue (I think). It appears these errors are timeout errors being thrown from within node timers. I think I may have a solution, but it isn't great and it will require users of the module to take additional steps.

For bacvkground, the issue is basically as outlined previously. The error event is occurring outside any of the promises which are actibve when there is interaction with the sftp connection. However, because this is occurring in asynchronous timer code, the error isn't propagated back into the main execution context. When these errors occur during interaction with the sftp connection, a promise exists and there is a listener which has reference to the reject function for that promise, so can propagate the error back to the main execution context. When outside active interaction with the sftp connection, there are no promises and therefore no link back to the main execution context. I incorrectly assumed listeners executed within the main execution contwext (don't ask me why, it is obvious they don't) and therefore raised errors would be 'catchable'.

My proposed solution is to modify the SftpClient constructor to also accept an error function which will be executed if the error has not been handled by one of the method listeners, allowing client code to react to the error however it see fit and remove throwing of excedptions by the global error listener. If no error handler is supplied, the global error listener will just log the error via the debugMsg function and ignore it. The default end and close listeners will ensure the internal referenvces to the srftp connection/socket within the SftpClient instance are cleaned up and made inactive so that further attempts to use the connection will result in normal promise reject behaviour.

ertl commented 1 month ago

Thanks for the update.

It's good to hear you've been able to reproduce the issue. The timeout errors from within node timers and their impact on asynchronous code execution make sense.

Your proposed solution to modify the SftpClient constructor to accept an error function seems like a reasonable approach. This would indeed allow more flexibility for client code to handle errors appropriately and ensure that uncaught errors are logged without causing unhandled exceptions.

theophilusx commented 1 month ago

As this module now does not throw any errors inside global event listeners and only does reject inside promise related listeners, any remaining problems with catching errors are outside control of this module and therefore not relevant so closing this issue as being resolved in the new version.

ertl commented 1 month ago

Great, thank you very much for your efforts.