s3fs-fuse / s3fs-fuse

FUSE-based file system backed by Amazon S3
GNU General Public License v2.0
8.55k stars 1.01k forks source link

s3fs re-downloading data rather than checking the cache #705

Open gkiar opened 6 years ago

gkiar commented 6 years ago

Additional Information

Field Value
s3fs --version 1.83
pkg-config --modversion fuse 2.9.7
uname -r 16.7.0
Distribution Mac OSX Sierra + Ubuntu 16.04 (in Docker, for data access)

s3fs command line used

s3fs mybucket /data/mymount/ -o passwd_file=/etc/awspasswd,umask=0007,use_cache=/data/cache -d -d -f

Details about issue

I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!

gkiar commented 6 years ago

Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown here). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown here).

Do you have any idea why the cache is being ignored in the Docker container? Thanks!

sqlbot commented 6 years ago

With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.

gkiar commented 6 years ago

@sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify:

  1. I have mounted the bucket via s3fs on my host system
  2. I run a task in a Docker container, ensuring to share both the cache and mount directory
  3. Each subsequent attempted access of a file (whether in the same Docker container or those launched after the files have been downloaded), ignores the cached copy and re-downloads the requested files, despite them appearing in the cache.

I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.

sqlbot commented 6 years ago

Yes, I think I did misinterpret this part:

when I access the data from within a Docker container with the mount and cache attached at the same location

I interpreted "with the mount" to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.

gkiar commented 6 years ago

Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄

ggtakec commented 6 years ago

@gkiar @sqlbot I'm sorry for my late reply.

s3fs can cache the contents of objects as local files. Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process. The stats cache holds the stats information of the object (file). This is the stats information retrieved by the HEAD request of each object.

In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information.

Since the stats cache is in the process memory, each process decides whether to update the cache of the object. If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time. However, current version will not be able to be expected the best operation when sharing the cache directory from another process.

sqlbot commented 6 years ago

@ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).

ggtakec commented 6 years ago

@sqlbot amd @gkiar I'm sorry, I made the same misunderstanding. I will see the problem in more detail. Regards,

Oldsouldier commented 6 years ago

Also noticing this issue. V1.84 /usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket

I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache.

Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.