Transport endpoint is not connected

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
No consistent reproduction yet. But I have seen this multiple times when at 
least two processes are uploading many files.

What is the expected output? What do you see instead?
At some point all requests to the mounted bucket start failing. When trying to 
change directory into the bucket, you get "Transport endpoint is not connected".

What version of the product are you using? On what operating system?
v1.35 on Ubuntu 10.10

Please provide any additional information below.
I was running the following processes in parallel (same mounted bucket, 
different subfolders):
wget -r -l1 -nd -Nc -A.png http://media.xiph.org/sintel/sintel-2k-png/
wget -r -l1 -nd -Nc -A.png http://media.xiph.org/BBB/BBB-1080-png/

The machine is running on EC2, so I was getting speeds of about 1-2 MB/s for 
each wget.

In the meantime 4 other processes where occasionally writing to another bucket. 
This bucket has no problems. The machine has 3 mounts (each to a different 
bucket) in total, the third was not in use.

Original issue reported on code.google.com by ptrm...@gmail.com on 28 Jan 2011 at 10:07

GoogleCodeExporter commented 9 years ago

If you need any additional information, or any debugging steps I could 
undertake: please let me know.

Original comment by ptrm...@gmail.com on 28 Jan 2011 at 10:07

GoogleCodeExporter commented 9 years ago

Hmm, just noticed that the wget's acutally seem to have succeeded. All the 
files seem to be there. The processes where running overnight and this morning 
I noticed the "Transport endpoint is not connected" error.

Original comment by ptrm...@gmail.com on 28 Jan 2011 at 10:11

GoogleCodeExporter commented 9 years ago

Ok, able to reproduce now!
1. Do one of the wget's (alternatively, probably works with any s3fs mounted 
directory that contains enough files)
2. cd to the directory
3. ls (waits ~2 seconds, then says "ls: reading directory .: Software caused 
connection abort")
4. ls (says "ls: cannot open directory .: Transport endpoint is not connected")

So the problem sees to be listing the contents of a directory with a large 
number of files (> 10.000 in this case). Could it be that s3fs does not deal 
well with directories containing more than 1.000 files? (the default for 
max-keys in a GET Bucket request according to 
http://docs.amazonwebservices.com/AmazonS3/latest/API/)

Original comment by ptrm...@gmail.com on 28 Jan 2011 at 10:19

GoogleCodeExporter commented 9 years ago

I have seen similar too

Initially changing the time-outs fixed this but then when more files were 
sync-ed to S3 the same thing started to happen again consistently after 25-30s 

As a way of testing try creating 3 or 4 thousand directories in an S3 bucket 
and then mounting the filesystem, do an ls on the mounted dir or try and rsync 
it back and it will error

It seems that there's a timeout somewhere when listing large numbers of files 
or folders from s3 which isn't overridable by an option as such, ls and rsync 
type operations fail

Original comment by chrisjoh...@gmail.com on 31 Jan 2011 at 11:42

GoogleCodeExporter commented 9 years ago

Also seeing this behavior under Ubuntu 10.04 with fuse 2.8.4 and s3f 1.35 built 
from source.

This problem seemed to start around s3fs 1.25 for us.

If there's anything I can do to help further diagnose the problem please let me 
know.

Original comment by Sean.B.O...@gmail.com on 2 Feb 2011 at 7:37

GoogleCodeExporter commented 9 years ago

For your information, I had the same issue and rolling back to s3fs 1.19 
(following Sean's comment) fixes the issue.

Original comment by yeoha...@gmail.com on 2 Feb 2011 at 10:19

GoogleCodeExporter commented 9 years ago

same behavior on debian lenny s3fs 1.35/fuse 2.8.5 built from source.

Original comment by ben.lema...@gmail.com on 3 Feb 2011 at 4:06

GoogleCodeExporter commented 9 years ago

syslog debug output:

Original comment by ben.lema...@gmail.com on 3 Feb 2011 at 6:25

Attachments:

debug_output.txt

GoogleCodeExporter commented 9 years ago

Thanks for the debug_output -- I have a good guess that will mitigate the issue 
-- since I haven't tried to reproduce the issue yet (I don't have a bucket with 
1000's of files), I'm looking for a volunteer to test the patch

...any takers?

But I don't think that the patch addresses the underlying issue though, and 
this is how directory listings are done. s3fs_readdir is probably the most 
complex piece of the this code and probably needs some tuning.

Original comment by dmoore4...@gmail.com on 4 Feb 2011 at 12:52

GoogleCodeExporter commented 9 years ago

I'll definitely test, I've got a few buckets with ~30K files in them.

Original comment by ben.lema...@gmail.com on 4 Feb 2011 at 1:16

GoogleCodeExporter commented 9 years ago

Give this patch a try.

Original comment by dmoore4...@gmail.com on 4 Feb 2011 at 3:39

Attachments:

curle_couldnt_resolve_host.patch

GoogleCodeExporter commented 9 years ago

I think that the patch resolves the "transport endpoint not connected" issue, 
but you'll still get input output errors on listing a directory with A LOT of 
files

...can someone confirm?

Original comment by dmoore4...@gmail.com on 4 Feb 2011 at 4:55

GoogleCodeExporter commented 9 years ago

I know get:
ls: reading directory .: Input/output error

Subsequently ls'ing or cd'ing to another directory timeouts after about 20 
seconds:
cd: <directory name>: Input/output error

Original comment by ptrm...@gmail.com on 4 Feb 2011 at 9:34

GoogleCodeExporter commented 9 years ago

It appears that the main contributing factor to this issue is the number of 
files in a directory.  Having a large number of files in a single directory (I 
can't quantify "large" just yet, but it seems to be >= 1000) isn't illegal, but 
the HTTP traffic that it presents when doing a directory listing appears to 
cripple the file system.

I personally have never seen this issue with my buckets, but apparently the 
practices that I use are not everyone's practices.

I created a bogus (to me) test case to try and duplicate the issue.  The patch 
above, resolves one of the initial failure points, but just pushes the issue 
back further.

An understanding of how directory listings are made wrt to s3fs is necessary to 
implement a fix.  Briefly, a query is made to S3 asking for a listing of 
objects that match a pattern (reminder, there is not a native concept of 
directories in S3, that is, that's not how things are stored).  For each of the 
matching objects, then do another query to retrieve its attributes. So a simple 
"ls" of a s3fs directory can generate A LOT of HTTP traffic.

It appears that Randy (the original author) attempted to address performance 
issues with this by using advanced methods in the CURL API.  Things are 
pointing to that area.

Fixing this may be an easy fix or a major rewrite, I do not know. As I 
mentioned, this section of code is one of the more complex sections in s3fs.cpp 
 One thought that I have is to scrap the multicurl stuff and replace it with a 
simpler brute force algorithm.  It may fix the issue, but the trade off with 
probably be performance.

Original comment by dmoore4...@gmail.com on 4 Feb 2011 at 5:09

GoogleCodeExporter commented 9 years ago

Is it possible we're running into HTTP KeepAlive issues?

After enabling CURLOPT_VERBOSE (s3fs.cpp+586), output at failure includes a 
line for each HTTP request, "Connection #0 to host example.s3.amazonaws.com 
left intact".

It seems to make sense, modifying the 'max-keys' query parameter from 50, to 
1000 does allow more objects to be returned however, the amount of before a 
failure remains the same: ~25s

$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real    0m25.357s
user    0m0.000s
sys     0m0.000s

$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real    0m26.869s
user    0m0.000s
sys     0m0.000s

$# cd /mnt/cloudfront/images
$# time ls
ls: reading directory .: Input/output error
real    0m26.274s
user    0m0.000s
sys     0m0.000s

Original comment by ben.lema...@gmail.com on 4 Feb 2011 at 6:31

Attachments:

debug_output.txt

GoogleCodeExporter commented 9 years ago

It looks like KeepAlive may actually be the issue, forcing the connection to 
close after each request does fix the problem however, I'm not convinced it's a 
solid solution as it's quite slow :\

Attached is a patch for testing.

Original comment by ben.lema...@gmail.com on 4 Feb 2011 at 7:01

Attachments:

forbid_reuse.patch

GoogleCodeExporter commented 9 years ago

Ben,

Great find.  I tested this on an EC2 instance (well connected to S3) by doing a 
directory listing of a bucket
that contains 10500 files.  No I/O error -- it was dog slow, but it worked:

% time ls -l

...

real    1m42.467s
user    0m0.160s
sys     0m0.670s

I was able to lightswitch the issue by removing the CURLOPT_FORBID_REUSE option:

$ time ll
ls: reading directory .: Input/output error
total 0

Looks like a good fix to me.

...more testing:

On my home machine (not so well connected to the internet) I tried the same fix 
and did a directory listing of
the same 10500 file bucket.  Again, no I/O error and the listing completed, but 
it took nearly half an hour:

% date ; /bin/ls -l /mnt/s3/misc.suncup.org/ | wc -l ; date 
Fri Feb  4 18:06:04 MST 2011
10503
Fri Feb  4 18:31:14 MST 2011

I'll do a subversion commit shortly.  Thanks so much for you contribution.

Original comment by moore...@gmail.com on 5 Feb 2011 at 1:33

GoogleCodeExporter commented 9 years ago

Pieter, please test r308 and report back. Thanks.

Original comment by dmoore4...@gmail.com on 5 Feb 2011 at 2:05

GoogleCodeExporter commented 9 years ago

I'm currently out of the country and have limited internet access. I will be 
able to test next week!

Original comment by ptrm...@gmail.com on 9 Feb 2011 at 12:45

GoogleCodeExporter commented 9 years ago

Ok, r308 seems to solve the problem.

Original comment by ptrm...@gmail.com on 14 Feb 2011 at 12:23

GoogleCodeExporter commented 9 years ago

Original comment by dmoore4...@gmail.com on 14 Feb 2011 at 5:20

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

i still have this problem on r368.
Running on Amazon EC2 (Amazon Linux AMI)

Original comment by K.Quiatk...@mytaxi.net on 9 Jul 2012 at 6:59

GoogleCodeExporter commented 9 years ago

I'm having the same issue with version 1.6.1 (r368) on CentOS 5.

Original comment by johnog...@gmail.com on 12 Jul 2012 at 10:24

GoogleCodeExporter commented 9 years ago

We are having the same issue with version 1.6.1 (r368) on Debian Squeeze
Reverting back to 1.6.0 and trying again. Will post back with results

Original comment by chris_r...@someones.com on 17 Oct 2012 at 4:36

GoogleCodeExporter commented 9 years ago

Am having the same issue with Centos 5.8 32 bit. Have tried s3fs 1.61 and 1.35 
with the same outcome.

Original comment by gmason.x...@gmail.com on 30 Oct 2012 at 12:12

GoogleCodeExporter commented 9 years ago

We are also having this issue with 1.6.1 on an Amazon EC2 AMI.
Our directory doesn't have too many files, though (200 or so).

Original comment by ferran.m...@mmip.es on 16 Jan 2013 at 3:07

GoogleCodeExporter commented 9 years ago

Im having the same issue on latest S3fs when trying to upload to a S3 bucket, 
it appears is happening to folders that handles a lot of files and folders only.

Original comment by nickz...@gmail.com on 18 Jun 2013 at 4:46

GoogleCodeExporter commented 9 years ago

Hi, nickzoid

I could not reappear this problem on s3fs(r449).
(I tested that many files(over 5000) copy to s3)

So that if you can, please try to use r449 and test it with some "multireq_max" 
and "nodnscache" option.
** And if you have same problem yet, please post NEW ISSUE with more 
information.

Thanks in advance for your assistance.

Original comment by ggta...@gmail.com on 20 Jun 2013 at 1:27

rayantony / s3fs

Transport endpoint is not connected #148