Open gkiar opened 6 years ago
Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown here). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown here).
Do you have any idea why the cache is being ignored in the Docker container? Thanks!
With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.
@sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify:
I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.
Yes, I think I did misinterpret this part:
when I access the data from within a Docker container with the mount and cache attached at the same location
I interpreted "with the mount" to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.
Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄
@gkiar @sqlbot I'm sorry for my late reply.
s3fs can cache the contents of objects as local files. Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process. The stats cache holds the stats information of the object (file). This is the stats information retrieved by the HEAD request of each object.
In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information.
Since the stats cache is in the process memory, each process decides whether to update the cache of the object. If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time. However, current version will not be able to be expected the best operation when sharing the cache directory from another process.
@ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).
@sqlbot amd @gkiar I'm sorry, I made the same misunderstanding. I will see the problem in more detail. Regards,
Also noticing this issue.
V1.84
/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket
I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache.
Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.
Additional Information
s3fs --version
1.83
pkg-config --modversion fuse
2.9.7
uname -r
16.7.0
s3fs command line used
Details about issue
I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!