pkg / sftp

SFTP support for the go.crypto/ssh package
BSD 2-Clause "Simplified" License
1.5k stars 379 forks source link

Upload file hangs #502

Open TimYi opened 2 years ago

TimYi commented 2 years ago

package version: github.com/pkg/sftp v1.13.4

Error: Manually cut off the network, the program hangs at the uploading file.

Expect: Detect network disconnection and return errors in a timely manner.

Code:

targetFile, err := client.Create(targetPath)
if err != nil {
    log.Printf("can't open target file, the error is:\n%v", err)
    return
}
defer targetFile.Close()
sourceFile, err := os.Open(filePath)
if err != nil {
    log.Printf("can't open source file, the error is:\n%v", err)
    return
}
defer sourceFile.Close()
_, err = io.Copy(targetFile, sourceFile)  // hangs here
if err != nil {
    log.Printf("can't copy file to the remote server, the error is:\n%v", err)
    return
}
sourceFile.Close()

This feature is very important for automatically monitoring the directory and uploading files continuously, otherwise a network interruption may cause the program hangs forever.

drakkan commented 2 years ago

Hi,

I did a quick test, after a very long time the network disconnection is detected

start 2022-03-16 16:27:50.780043475 +0100 CET m=+1.755726143
end, n 3440640, err connection lost, elapsed 17m16.004653336s

about 17 minutes in my test.

I think this error is detected in crypto/ssh, @puellanivis do you have any idea?

puellanivis commented 2 years ago

🤔 I’m not sure how much we can really do here. Like you say, the network disconnect logic is not something we can control. Though, perhaps we could build a watchdog, but then the user could build a watchdog themselves, and then they could exert way better control on that.

If this is going into some sort of service, or long-lived program, you’ll probably want some sort of disconnect/reconnect logic anyways?

PS: If we do a v2 of the client API, then using contexts would make these timeout things much nicer.

drakkan commented 2 years ago

Hi,

I did a quick test, after a very long time the network disconnection is detected

start 2022-03-16 16:27:50.780043475 +0100 CET m=+1.755726143
end, n 3440640, err connection lost, elapsed 17m16.004653336s

about 17 minutes. I think this error is detected in crypto/ssh, @puellanivis do you have any idea?

thinking I’m not sure how much we can really do here. Like you say, the network disconnect logic is not something we can control. Though, perhaps we could build a watchdog, but then the user could build a watchdog themselves, and then they could exert way better control on that.

If this is going into some sort of service, or long-lived program, you’ll probably want some sort of disconnect/reconnect logic anyways?

PS: If we do a v2 of the client API, then using contexts would make these timeout things much nicer.

I don't think it will be that easy to interrupt an hanging read/write, but I could be wrong

puellanivis commented 1 year ago

If I were running into this issue myself, I would probably implement a work-around piecemeal Copy function that performs writes in a goroutine with the primary goroutine waiting on a select { case <-time.After(…): … ; case <-errCh: … } to get a poor-mans write deadline.

I’m not sure that we could fix anything on our end… if the Write hangs, that’s kind of all we got. I suppose it might be possible to plumb a WriteDeadline through… but actually, no. There’s too many abtraction layers between our Write and the network connection itself.

drakkan commented 1 year ago

If I were running into this issue myself, I would probably implement a work-around piecemeal Copy function that performs writes in a goroutine with the primary goroutine waiting on a select { case <-time.After(…): … ; case <-errCh: … } to get a poor-mans write deadline.

I’m not sure that we could fix anything on our end… if the Write hangs, that’s kind of all we got. I suppose it might be possible to plumb a WriteDeadline through… but actually, no. There’s too many abtraction layers between our Write and the network connection itself.

I agree, this can be easily fixed on the application side. I periodically issue a Getwd command and if I get no response after 30 seconds I close the connection