Incomplete downloads - Githubissues

mrwonko commented 3 years ago

Since upgrading to 1.13 the library will often report EOF before a file opened with Client.Open() is downloaded completely. For example a 634618 byte file has stopped downloading after just 65536 bytes, and a 2367599 byte file after 2031616.

As a workaround I've added a call to File.Stat().Size(), which returns the correct file size, and compare that with the downloaded size to check for success.

puellanivis commented 3 years ago

Queries: Is it returning an io.EOF error, or a nil error? How are you transferring the file? Client.Open() then a io.Copy?

mrwonko commented 3 years ago

Yes, Client.Open() followed by io.Copy(), which returns a nil error, implying that the reader itself returned io.EOF.

puellanivis commented 3 years ago

Hm… this is proving pretty hard to pin down what might be going wrong with WriteTo. I see one path that could result in an err == nil case without an io.EOF but no one closes any of those channels anyways, so it seems unlikely to be that, and I don’t really see how we could be getting an io.EOF from anything other than an SSH_FXP_STATUS…

Clearly some number of following bytes are not being committed, this can be clearly seen, because the two mistaken file size examples are at 32768 byte boundaries…

mrwonko commented 3 years ago

the two mistaken file size examples are at 32768 byte boundaries

Sounds right, other examples like 1343488, 1638400 and 786432 also fulfil that.

puellanivis commented 3 years ago

I’m thinking we might want to try a cherry pick that covers the unexpected channel close case. I’m entirely unsure how that could ever trigger, but it might be the problem, as that is the only other way we can get a nil error without an EOF, and like, it very clearly shouldn’t be a nil error… it was just like that as a hold-over from an earlier draft.

This could also explain the observed behavior, some number of 32 KiB writes work, then we receive on a closed channel, and stop all transfers.

drakkan commented 3 years ago

@puellanivis thank you. So if I want try to replicate this issue it should happen only for unexpected transfer errors, right?

puellanivis commented 3 years ago

@drakkan That’s the thing, this really shouldn’t happen for unexpected transfer errors either. Those should all percolate some non-EOF error up the line as well? 🤔

drakkan commented 3 years ago

I tryed to replicate this issue with no luck using something like this:

  for i := 0; i < 1000; i++ {
        start := time.Now()
        f, err := sftpClient.Open("file.zip")
        if err != nil {
            panic(err)
        }
        dst, err := os.Create("/tmp/file.zip")
        if err != nil {
            panic(err)
        }
        n, err := io.Copy(dst, f)
        fmt.Printf("it %v written %v, err %v elapsed %v\n", i, n, err, time.Since(start))
        if n != 61449068 {
            fmt.Printf("errrrrrrrrrrrrrrrrrrrrrr\n\n\n\n")
        }
        f.Close()
        dst.Close()
    }

@mrwonko, can you please confirm that #425 fixes this issue in your case?

@puellanivis I tryed this over a VPN connection, to trigger the error I tryed to disconnect the VPN, in this case more than 15 minutes are needed before the connection lost error is triggered

it 0 written 2654208, err connection lost elapsed 17m10.208717111s

what do you think about? Should we add a context or such? thank you

mrwonko commented 3 years ago

I will give #425 a try next week and let you know if it helps.

mrwonko commented 3 years ago

I've deployed #425 on our preproduction system a couple of hours ago and there was already the first partial download, 98304 bytes instead of 240430. I'll keep it there until tomorrow to see if it happens again, but then I'll probably roll back, it doesn't seem to work.

mrwonko commented 3 years ago

There have been about a dozen failed downloads at this point, #425 doesn't seem to fix our issue.

drakkan commented 3 years ago

@mrwonko can you please provide a reproducer? I tryed the above code with no luck.

I'm not very familiar with client side code but if I have an easy way to reproduce the issue I'll try to fix it, thank you

puellanivis commented 3 years ago

The only think I can think of now is the possibility that there are writes and reads happening to the remote file at the same time. So, when the read tries to read a value at offset XYZ, it’s met with an io.EOF and so it assumes the read is complete. Since the client can really only receive either data, or an io.EOF that means that setup will only ever write a 32k. 🤔 This still seems so weird though.

puellanivis commented 3 years ago

Going through with a fine-toothed comb again, I’m seeing numerous small issues here and there. I’ll be making a PR to address them, and ensure that sendPacket does not ever return an io.EOF error, and the readChunkAt refuses to recognize an err == io.EOF from sendPacket as well.

Since proper EOFs in the SFTP library should only ever come via an SSH_FX_EOF status packet, and never raw from the sendPacket.

drakkan commented 3 years ago

@puellanivis thank for your patch

@mrwonko can you please a give a try to #429?

mrwonko commented 3 years ago

I'll give that new patch a try.

I've not yet been able to reproduce this locally due to unrelated issues, so instead I'm testing using our normal downloaders on our preproduction cluster, which reliably reproduces it. That said, the snippet you posted should theoretically be sufficient for a reproduction.

mrwonko commented 3 years ago

I still get partial downloads with #429.

puellanivis commented 3 years ago

I still get partial downloads with #429.

🙃 wth?

puellanivis commented 3 years ago

What are the architectures and programs running on both sides of this? Like, clearly the client in this situation is from our library, but what kind of server is it connecting to? Is that server running off of our package as well?

puellanivis commented 3 years ago

I know this might be a long shot, but do you still get incomplete downloads with #435 ?

mrwonko commented 3 years ago

Our client runs in an x86_64 docker container. All I know about the server is that it runs on Windows, and I'm pretty sure it doesn't use this library.

Since we don't run in 32 bit, I see no point in trying #435?

puellanivis commented 3 years ago

So, while the problem was manifest in 32-bit, but there were still other possible edge conditions that could have manifest the same behavior. I’m honestly not sure and I’m grasping at straws here.

carsten-luckmann commented 3 years ago

As far as I understand, the server is running JSCAPE and accesses the files it serves via some network file system.

drakkan commented 3 years ago

@carsten-luckmann @mrwonko we tried to reproduce the issue with no luck, you have two choices:

1) use v1.12 forever and/or hope that a refactoring of the library will also fix this regression 2) provide more information, a small reproducer would be ideal but also adding some debug logs in pkg/sftp, printing, for example, the concurrency value etc. for the non-working case could help us

Thank you!

j0hnsmith commented 3 years ago

I might be seeing this problem too. Using v1.13.0 we're doing

fp, err := sftpClient.Open("somepath")
...
data, err := ioutil.ReadAll(fp)
...

There are no errors (so I assume io.EOF was seen) but the data is short, in our case for 74kb and 69kb files the exact size written is 65024 bytes. Unfortunately the server deletes the files as soon as they're read to completion (the server did delete the files which suggests data was read but not returned fully) so we can't try again with different branches/PRs. Given the size is different to those reported above and io.Copy() wasn't used I'm not sure if it's the same problem or not.

puellanivis commented 3 years ago

:scream: I think I might see what’s going on with your situation @j0hnsmith . While using concurrent reads, if we request the end-of-file read, the end server might trigger the “download is complete” functionality, and delete the file from its side and maybe even Truncate it to 0 length.

This would mean any later reading of the file at any point in the file would end up with an EOF since it is now beyond the length of the file. Causing a a premature and silent EOF to happen as some odd 32k boundary less than the length of the whole file.

I had considered needing to have the writes well-ordered, but I had not anticipated a need to ensure monotonously increasing read offsets in WriteTo.

puellanivis commented 3 years ago

:thinking: I’m thinking about if I should work on a patch to get well-ordered monotonously increasing reads into WriteTo first before the other PR with the big rewrite to use the internal filexfer package.

That other PR doesn’t really touch any of the WriteTo or ReadFrom code, so I think it should be pretty safe, even if it means the other PR will need some adjustment after the fix is merged. I’ll try to get something out soon. I think there is good cause to prioritize this over the refactor right now, as it is causing serious issues.

puellanivis commented 3 years ago

https://github.com/pkg/sftp/pull/436 can people try this PR? And see if it fixes the issue?

drakkan commented 3 years ago

hopefully, fixed in v1.13.1

Coffeeri commented 9 months ago

436 can people try this PR? And see if it fixes the issue?

This seems to be still the case in v1.13.6. Some files are unexpected and silent cutoff when using io.ReadAll. However as earlier noted, using io.Copy solves this.

fp, err := sftpClient.Open("somepath")
...
bs, err := io.ReadAll(fp)
// vs
var buf bytes.Buffer
_, err = io.Copy(&buf, fp)

puellanivis commented 9 months ago

Hm… trippy. So, io.ReadAll only uses Read from the io.Reader. And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

We’ll probably want to do some sort of check that if n != l { err = io.UnexpectedEOF } else { err = io.EOF } 🤔

puellanivis commented 9 months ago

Actually reopening this issue.

Coffeeri commented 9 months ago

And looking at the code, it seems like we’ve made the assumption that a short read means err = io.EOF, but this might coincide with a short packet that failed to read the whole way.

I tried to break on err = io.EOF, however it is never reaching that point.

The EOF "error" is generated by err = normaliseError(unmarshalStatus(packet.id, s.data)) . I reach this point four times, does this even make sense? Should not only one request reach the EOF?

puellanivis commented 9 months ago

If the io.EOF error is coming from line 1155, then that is a proper EOF, according to the SFTP server.

DrHayt commented 4 months ago

Ok, we just got hit by this issue when transitioning from a stock openssh sftp server to one from GoAnywhere.

We have no issues with the stock, but we do have issues that manifest exactly as described here when using the GoAnywhere server.

We are going to try to setup something to trace the sftp communication so that we can compare and contrast and maybe help solve the problem.

varunbpatil commented 1 month ago

I also have the same issue that @Coffeeri was describing. Specifically, io.Copy has no issues, but io.Read* reads incomplete data.

My use case includes reading the file twice:

The first read is to calculate md5sum of the file contents.

srcFile, err := sftpClient.Open("somepath")
hash := md5.New()
_, err = io.Copy(hash, srcFile)
md5sum := hash.Sum(nil)
srcFile.Close()

This works correctly everytime. It returns the correct md5sum.

The second read is to actually upload the file to S3.
```
srcFile, err := sftpClient.Open("somepath")
_, err = uploader.Upload(ctx, &s3.PutObjectInput{
    Bucket: aws.String(s3Bucket),
    Key:    aws.String(s3Path),
    Body:   srcFile,
    ContentMD5: aws.String(<md5sum calculated in the previous step>),
})
```
This step fails many times before succeeding. Specifically, AWS returns BadDigest which means the ContentMD5 we specified does not match the data we uploaded. Even more interestingly, the md5sum that AWS returns is equal to the md5sum of some random 32K byte boundary (sometimes even 0 byte) of the actual file.

The reason I cannot use io.Copy here is that the files I'm uploading are large and do not fit into memory all at once.

puellanivis commented 1 month ago

io.Copy does not read the whole file into memory, but instead makes 32 KiB reads into a temporary internal buffer, which it then writes out to. But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

What version are you using? Your use case should work otherwise. Also, you can use srcFile.Seek(0, 0) to reset the offset to the start of the file.

varunbpatil commented 1 month ago

Hi @puellanivis ,

But I guess you mean, copying the whole file into a bytes.Buffer before uploading.

Yes, sorry, I meant to say that I was doing two separate reads because the file doesn't fit into memory and that the first read (the md5sum calculation) works perfectly everytime while the second read (the s3 upload) returns different contents each time until it eventually succeeds after several retries.

I guess srcFile.Seek(0, 0) would also work instead of a separate open file for the second read, but I was more intrigued by io.Copy() being very consistent (in returning the full contents of the file) compared to Read() which is what I assume AWS s3 upload is using because I've also tried wrapping sftp.File in a custom struct and only exposing the Read() interface.

I'm currently on version 1.3.1. Haven't tried the latest version. Do you see something that might explain it for the older version? Thanks.

puellanivis commented 1 month ago

Hi varunbpatil,

We’ve fixed a few thing that could possibly be fixed in one of the later versions. I would recommend trying the newest version (there isn’t any reason to not update, it’s safe), and see if the issue persists.

pkg / sftp

Incomplete downloads #424

436 can people try this PR? And see if it fixes the issue?