hasse69 / rar2fs

FUSE file system for reading RAR archives
https://hasse69.github.io/rar2fs/
GNU General Public License v3.0
279 stars 27 forks source link

rar2fs doesn't cache decompressed/decrypted data #169

Open fdegros opened 3 years ago

fdegros commented 3 years ago

rar2fs doesn't seem to cache uncompressed data, and this is problematic when the application reading a file inside a mounted RAR is accessing this file in a non-sequential way (ie random access).

Reproduction steps:

  1. Create a RAR containing a large video file (eg an MP4 file). Make sure that the RAR archive compresses the MP4, and doesn't just store it without compression.
  2. Mount the created RAR.
  3. Try playing the MP4 file from the mounted RAR, using a video player of your choice.
  4. Try jumping to different point of the video.

Most likely the video player won't be able to play the video at step 3, or it will freeze when you try to jump to a different point of the video at step 4.

Note that if the MP4 is simply stored without compression in the RAR at step 1, then everything works fine and the video player is able to play and jump to different points of the video.

hasse69 commented 3 years ago

Thanks for the issue report. I took the liberty to change the subject since I believe it was a typo.

The limitation for compressed/encrypted archives is clearly stated in the README Due to that the RAR algorithm does not allow random access and is depending on blocks being extracted in sequential order this is close to impossible to achieve. Imagine the case of a e..g a 4GB archive for which you try to jump to the end? You would need to wait for the entire file to be extract before it could be read properly. Even if the user could accept such a delay/freeze it is also a question of if the tool used would.

To mitigate some of the problematic use-cases there is a history buffer (call it a cache if you like) which is by default set to half the size of the I/O buffer. Currently this buffer is set very small but can be tweaked using dedicated mount options. But to handle complete random access is as I see it not feasible.

hasse69 commented 3 years ago

So just to put some context into how this currently works for compressed/encrypted archives is that extraction is done on-the-fly (separate process) towards an intermediate I/O buffer. It is from this I/O buffer that read requests picks up the data (offset + size). But since RAR archives in these modes cannot go "backwards" there is a sliding window placed inside the I/O buffer that can deal with such requests provided window is not exceeded (negative offset + size). The history buffer is as stated by default set to half the size of the I/O buffer and the default size of the buffer is 8M (thus very small). So the size of the history it can deal with by default is 4M from the current file position.

In the context of this, the reason for this approach is that rar2fs was originally designed for smaller embedded devices with very limited memory resources. You can tweak the buffer to any preferred size and you can also tweak how much of that buffer is marked as history up to 90% down to 10% AFAIR. So this is something each user would need to deal with depending on what is the most common use-case and balance with respect to system resource usage.

So of course we could add some more traditional (optional) caching behaviour for which the history (cache) can grow with the amount of data consumed by the reader or it could grow with how much data has currently been extracted. But that would obviously mean that for large media files the memory it would require can quickly drain the available system resources. And it would also be very hard to control when this data is no longer needed. What ever approach you would take would probably be the wrong one under some scenarios. I don't think anyone would like to keep a cache of 4G+ data and to keep it only for the duration of open/close, then what is the point?

I am not 100% sure how e.g. the Linux page cache fits into this picture as well. FUSE should be able to skip calling down to the user mode file system but if it does then rar2fs would not be able to maintain current file position etc and it would become chaos in case some read(s) randomly gets through. For that reason rar2fs enforces caching on that level to be disabled so that each and every access reaches the file system. For uncompressed archives, not so much, since that is not depending on sequential reads at all.

A lot of thoughts/time has been spent on this particular issue and it always boils down to the same result, whatever we do, something will not work. Also, all proposals so far still comes short due to the actual simple fact here being that you cannot read data that does not yet exist and for which it might take a substantial amount of time before it becomes available. By that time either the user or the tool in question has most likely already been giving up on you.

I do however welcome any ideas in this area.

hasse69 commented 2 years ago

Any updates here? Otherwise I will close this issue due to no activity.

fdegros commented 2 years ago

This is still an issue. For example, see ChromeOS bug 1313344.

hasse69 commented 2 years ago

I am afraid I have no access to that resource. I understand this is still an issue but this is a limitation that has been present since day one. You could cache what has been read but you cannot cache what has not, the RAR algorithm is simply not designed for random access.

fdegros commented 2 years ago

Right.

I had to solve the same problem for mount-zip, and it took quite a few iterations to get the right balance between memory usage, disk usage, and getting it to work with huge files even on 32-bit devices.

Maybe there is something that could be transposed and reused in rar2fs. See the Reader class and subclasses, including CacheFileReader.

hasse69 commented 2 years ago

Yes, caching is doable and possibly there might be, as you say, some good enough and fair algorithm for resource usage. But that is not really my concern here, my concern is that to be able to cache something from a compressed/encrypted archive you first must have been able to read it. If you access the archive before it has been entirely read, there is nothing to be cached. Believe me, I have looked into this before, applying things like read-ahead and data pre-fetch but the problem remains that archive data cannot be properly cached unless read in its whole at least once. And for that to happen you would stall extraction completely if random access jumps are significant. It was basically useless for e.g. a 4GB media file for which a user tried to jump towards the end. If the user did not give up, what ever was used to extract the data, timed out.

hasse69 commented 2 years ago

So, do not read me wrong, I am not saying something cannot be done, I am only saying I need to carefully think about what are the options here. And in fact, there might actually be a rather simple way forward. All the feedback provided by me here so far has been related to the fact that there are basically two very specific use-cases in rar2fs, the primary use-case in which users can mount an entire directory tree, and what I would call a secondary use-case, which is the support for mounting individual/single archives. Normally I never spend much time on the latter because if problems are resolved for the primary use-case, most likely it would be solved also for the secondary one. Thus all efforts I have made so far have been made having the primary use-case in mind only. But possibly we might have a slightly different situation here.

As I understand your use-case in ChromeOS, it is the secondary one we are talking about and not the primary one, right? That might actually give us some room for trying out a few poor-man's solutions here. Since a user mounting an archive under the file manager in ChromeOS very likely does that with the intent to also access data from it, and when unmounting a user would not expect data to be available any more "offline" so to speak. That gives us a few distinct points in time where we can pre-fetch data, caching it, and also have a pretty good clue when the cache is no longer needed and can be discarded. Thus the simplistic idea I have in mind would be for rar2fs to start extraction (optionally of course) of the mounted archive as soon as it is mounted and extracted data would be saved externally but not to memory. Any access done towards the archive would be redirected to the already extracted data, and if data is not yet available, it needs to stall, waiting for it to arrive. That means that the file manager overlay (or whatever it is called) must be able to handle such "stalls" as well as whatever application is in need of the data itself. This is something completely outside the scope of rar2fs and thus not something that it can guarantee either.

So the implementation would be rather simple, but then what are the drawbacks with this approach? I see a couple few only, but possibly they can be accepted/overlooked:

Feel free to chime in here with any questions/doubts etc.

There is another "issue" here as well and that is related to workload and priority. Is this issue more important to solve than e.g. the issue related to the primary use-case and improving the file cache propagation over potentially slow network connections? The fact remains I have a rather limited amount of time to spend on this project currently :( I will of course try to do my best to solve both, but one has to go before the other. I need to look into what could possibly be the faster to implement of the two.

EDIT: Would it be ok with a draft patch attached here for you to verify if a potential solution would be sufficient for your use-case in ChromeOS?