skoobe / riofs

Userspace S3 filesystem
GNU General Public License v3.0
393 stars 60 forks source link

Cache thrashing when multiple users try to read newly added file at about the same time #136

Open ThePythonicCow opened 7 years ago

ThePythonicCow commented 7 years ago

I have been working (when I had the time) on the following problem over the last few months. At some point, I expect to propose some patches to address it.

In my use case for riofs, I upload large (often 10 to 1000 megabyte) files directly to my Amazon AWS store, and then I announce the availability of the files to a large user community, with links to my riofs download and caching server that is a front end for that AWS store. Soon after announcing, multiple users, sometimes dozens at a time, will try to download the lastest announced file.

If multiple users try to download the same, newly announced, file at nearly the same time, then my riofs log file shows many, many lines "invalidating local cached file!", due to "Local and remote file sizes do not match" or due to "Failed to get local MD5 sum".

What was happening was that a second user would come asking for the same file that another user had already started to download. The "consistency checking" in src/file_io_ops.c:fileio_read_on_head_cb() would notice that the already partially cached file did not (yet) have a size or MD5 sum matching what Amazon AWS expected for that file, and so would invalidate the partially filled cache for that file. Since the (more difficult) code to compute the multi-part ETag for large files has not been, and might never be, in riofs, the MD5 sum check has no chance of matching on such large files, and if a second download request shows up on a file that is only partially cached, the size check will fail as well.

I ended up racking up about fifty times more AWS download fees in a few days than it would have cost to download all the big files into the riofs cache one time. I have had to stop adding and announcing more big files, until I could resolve this issue of severe cache thrashing, as the AWS download fees were exceeding my budget for this project.

I have now got working (for the first time yesterday) a fix for this, using AWS ETag's. I anticipate that this fix will significantly improve the "consistency checking" in src/file_io_ops.c:fileio_read_on_head_cb().

My current plan is to complete the initial development and testing of the patch set providing this work, and to upload it to my clone of riofs, at https://github.com/ThePythonicCow/riofs . Then I will offer it to Paul Jonkins (https://github.com/wizzard) to pull into the main Riofs repository https://github.com/wizzard/riofs, though this may well be a significant enough change that he will prefer to let it "bake in the oven" for a while, until he and others find time to consider it carefully.

Initially, I had expected that I would need to make this change "persistent", preserving the contents of a local riofs cache across riofs restarts, in order to avoid the AWS charges to rebuild my riofs cache everytime I restarted riofs. But now I am of the view that just making proper use of ETag's will avoid thrashing and repeatedly "invalidating local cached file!" and thereby dramatically reduce AWS download charges, sufficient for my needs.

ThePythonicCow commented 7 years ago

I have now finished this change and uploaded it to my clone of riofs, at https://github.com/ThePythonicCow/riofs.

I will issue a pull request.

However this is a more ambitious change than some, so as noted above, it would not surprise me if Paul Jonkins delays accepting it until he has time to look at it more closely.

When Jonkins accepts this (or asks for further changes) matters little to me, as I have this new version, which uses ETags to determine when to invalidate the cache, running in the application that matters to me.