pkg / sftp

SFTP support for the go.crypto/ssh package
BSD 2-Clause "Simplified" License
1.52k stars 380 forks source link

Regression causing deadlock between File.Close and File.WriteTo #603

Open ncw opened 2 weeks ago

ncw commented 2 weeks ago

I noticed when I upgraded the dependencies for rclone that the sftp server tests have started deadlocking.

I used git bisect to narrow this down to this commit d1903fbd460e9a8105bae72fcdf492a4999b4cee

This made an enormous backtrace which I removed all traces without pkg/sftp in and deduplicated the traces which left us with three.

I think backtrace 1 is irrelevant. What appears to have happened is that a goroutine started (backtrace 3) which is holding the File lock

The code from rclone which causes this is this (backtrace 3)

func (f *Fs) newObjectReader(sftpFile *sftp.File) *objectReader {
    pipeReader, pipeWriter := io.Pipe()
    file := &objectReader{
        f:          f,
        sftpFile:   sftpFile,
        pipeReader: pipeReader,
        done:       make(chan struct{}),
    }
    // Show connection in use
    f.addSession()

    go func() {
        // Use sftpFile.WriteTo to pump data so that it gets a
        // chance to build the window up.
        _, err := sftpFile.WriteTo(pipeWriter)
        // Close the pipeWriter so the pipeReader fails with
        // the same error or EOF if err == nil
        _ = pipeWriter.CloseWithError(err)
        // signal that we've finished
        close(file.done)
    }()

    return file
}

And here (backtrace 2)

// Close a reader of a remote sftp file
func (file *objectReader) Close() (err error) {
    // Close the sftpFile - this will likely cause the WriteTo to error
    err = file.sftpFile.Close()
    // Close the pipeReader so writes to the pipeWriter fail
    _ = file.pipeReader.Close()
    // Wait for the background process to finish
    <-file.done
    // Show connection no longer in use
    file.f.removeSession()
    return err
}

So the problem is that rclone expects the line err = file.sftpFile.Close() to close the sftp File and return an error, while now what happens is that because of the locking added in d1903fbd460e9a8105bae72fcdf492a4999b4cee the Close waits indefinitely for the File.WriteTo() to finish which it won't since rclone has stopped putting data into it.

So I think this is a regression - this has worked perfectly for many years!

I haven't made a small reproduction - I probably can now I understand what is going on - but maybe you've got an immediate idea on how to solve this problem?

Backtrace 1

Backtrace 2

Backtrace 3

puellanivis commented 2 weeks ago

🤔 Right… so to address this, we would need to trigger a signal to close any outstanding WriteTo or ReadFrom (or heck, any outstanding Read or Write)…

Actually, let’s look at the behavior of os.File:

package main

import (
    "io"
    "fmt"
    "os"
    "sync"
)

func main() {
    f, err := os.Open("/dev/zero")
    if err != nil {
        panic(err)
    }

    rd, wr := io.Pipe()

    var wg sync.WaitGroup

    wg.Add(1)
    go func() {
        defer wg.Done()

        _, err := f.WriteTo(wr)
        _ = wr.CloseWithError(err)
    }()

    lr := &io.LimitedReader{ R: rd, N: 42 }

    b, err := io.ReadAll(lr)
    if err != nil {
        panic(err)
    }

    fmt.Println(string(b))

    err = f.Close()
    if err != nil {
        panic(err)
    }

    wg.Wait()
}

output:

close-on-open-fd$ go run main.go | allcat -o clip:
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
        /usr/lib64/go/1.23/src/runtime/sema.go:71 +0x25
sync.(*WaitGroup).Wait(0xc0000b0180?)
        /usr/lib64/go/1.23/src/sync/waitgroup.go:118 +0x48
main.main()
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:42 +0x232

goroutine 6 [select]:
io.(*pipe).write(0xc0000b01e0, {0xc0000d0000, 0x8000, 0x8000})
        /usr/lib64/go/1.23/src/io/pipe.go:86 +0x1e5
io.(*PipeWriter).Write(0x55da90?, {0xc0000d0000?, 0x46567f?, 0xc00008e048?})
        /usr/lib64/go/1.23/src/io/pipe.go:161 +0x1a
io.copyBuffer({0x4deb38, 0xc0000b01e0}, {0x4dec38, 0xc00008e048}, {0x0, 0x0, 0x0})
        /usr/lib64/go/1.23/src/io/io.go:431 +0x1de
io.Copy(...)
        /usr/lib64/go/1.23/src/io/io.go:388
os.genericWriteTo(0xc00008e040?, {0x4deb38, 0xc0000b01e0})
        /usr/lib64/go/1.23/src/os/file.go:275 +0x4f
os.(*File).WriteTo(0xc00008e040, {0x4deb38, 0xc0000b01e0})
        /usr/lib64/go/1.23/src/os/file.go:253 +0x9c
main.main.func1()
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:24 +0x57
created by main.main in goroutine 1
        /home/snowgirl/Work/tmp/go/close-on-open-fd/main.go:21 +0x150
exit status 2

It seems here, while the Close() succeeds the WriteTo() doesn’t crash out or return an error.

🤔

P.S.: Adding a rd.Close() before the Wait allows it to complete and WriteTo returns io.ErrClosedPipe

puellanivis commented 2 weeks ago

Min-repro:

package main

import (
    "io"
    "fmt"
    "net"
    "os"
    "sync"

    "golang.org/x/crypto/ssh"
    "golang.org/x/crypto/ssh/agent"
    "github.com/pkg/sftp"
)

func main() {
    var auths []ssh.AuthMethod
    if aconn, err := net.Dial("unix", os.Getenv("SSH_AUTH_SOCK")); err == nil {
        auths = append(auths, ssh.PublicKeysCallback(agent.NewClient(aconn).Signers))

    }
    auths = append(auths, ssh.Password("notasecret"))

    config := ssh.ClientConfig{
        User:            "testuser",
        Auth:            auths,
        HostKeyCallback: ssh.InsecureIgnoreHostKey(),
    }

    addr := fmt.Sprintf("127.0.0.1:2022")

    conn, err := ssh.Dial("tcp", addr, &config)
    if err != nil {
        panic(err)
    }
    defer conn.Close()

    cl, err := sftp.NewClient(conn)
    if err != nil {
        panic(err)
    }
    defer cl.Close()

    f, err := cl.Open("/dev/zero")
    if err != nil {
        panic(err)
    }

    rd, wr := io.Pipe()

    var wg sync.WaitGroup

    wg.Add(1)
    go func() {
        defer wg.Done()

        _, err := f.WriteTo(wr)
        _ = wr.CloseWithError(err)
    }()

    lr := &io.LimitedReader{ R: rd, N: 42 }

    b, err := io.ReadAll(lr)
    if err != nil {
        panic(err)
    }

    fmt.Printf("%v\n", b)

    err = f.Close()
    if err != nil {
        panic(err)
    }

    wg.Wait()
}

This locks up:

$ go run main.go
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Adding an rd.Close() at the bottom causes the process to succeed, with the same io.ErrClosedPipe coming from the WriteTo.

It seems that things were only ever working because the io.Pipe is triggering the end. Moving the rd.Close() to before the f.Close() works even on v1.13.7.

ncw commented 2 weeks ago

Thank you so much for making a repro @puellanivis

You are right, closing things in a different order makes the deadlock go away

--- a/backend/sftp/sftp.go
+++ b/backend/sftp/sftp.go
@@ -2101,10 +2101,10 @@ func (file *objectReader) Read(p []byte) (n int, err error) {

 // Close a reader of a remote sftp file
 func (file *objectReader) Close() (err error) {
-       // Close the sftpFile - this will likely cause the WriteTo to error
-       err = file.sftpFile.Close()
        // Close the pipeReader so writes to the pipeWriter fail
        _ = file.pipeReader.Close()
+       // Close the sftpFile - this will likely cause the WriteTo to error
+       err = file.sftpFile.Close()
        // Wait for the background process to finish
        <-file.done
        // Show connection no longer in use

I think the new way is probably the best way and the fact that it worked before was due to an implementation detail of the sftp library.

I'm happy with that fix and so we can close this if you are happy?

puellanivis commented 2 weeks ago

Yeah, I’m still thinking I’d rather repeat the os.File.Close behavior, since we’re generally trying to mirror that usage. Though, it would probably be quite a bit of work to ensure we don’t also introduce a race condition, where the WriteTo could theoretically end up making one or more reads from a different file due to a recycled file handle on the server side.

I’ll timebox this to a week. I would like to at least get dev-v2 working the same as os.File.Close (that is, it deadlocks without the rd.Close() but at least doesn’t deadlock on file.Close()). Then depending on how much work that takes, I’ll have a perspective on how to bring the fix to v1.

ncw commented 1 week ago

Sounds like a plan! For my use, I'll put the above tweak in and update rclone to v1.13.7 for its next release.

I haven't looked at the v2 branch yet - should I?

puellanivis commented 1 week ago

Yeah, I think the tweak of closing the read pipe first is definitely the better solution in general. 👍

Yes, you can look at the dev-v2 branch already. 😁

puellanivis commented 1 week ago

604 works! 😁 Just like the os.File package in a couple of different configurations.