nagyistoce / rar2fs

Automatically exported from code.google.com/p/rar2fs
0 stars 0 forks source link

Feature request: Cashing to files. #6

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi Hans,

I was thinking that caching to files might speed up things a little.
The idea is:
for each rar archive build a (md5) hash per file and cache that hash to a file 
(.$someprefix-$nameofrar.cache). When reading directories check for the cache 
file, if exists check md5sum of file it's representing , and if so,  use the 
info of the cache file instead of reading the rar file again.

Cheers,
Joris

Original issue reported on code.google.com by wiebel@gmail.com on 30 May 2011 at 9:26

GoogleCodeExporter commented 9 years ago
Hello. I agree that the idea seems very tempting.

I have had similar thoughts before, having the contents of each RAR archives 
cached in some way. There is already a cache for each file inside the archive, 
but nothing that describes the archive itself. However, there were some reasons 
why I left this idea behind. There are infact some options you should use to 
speed up things already. It is the --exclude option and --seek-length. Using 
these options speed up loading a folder and contents of archives (especially 
large volumes) dramatically. Read the rar2fs.1 man page for details. 

Since the original target for rar2fs was in fact small embedded Linux systems 
with no equipped HDD, and writing a lot of information to the underlying file 
system on flash did not seem like a good idea, the cache need to be memory 
resident. Memory is another limiting resource in most typical embedded systems.

Having the cache stored on disk is not going to improve things that much. The 
file still need to be opened/closed for reading and the contents need to be 
parsed. Also, the md5 checksum must be calculated each time to be able to 
compare with the on disk copy. It would of course avoid the need for the 
--seek-length option. On the other hand, keeping a cached copy of each archives 
file structure in memory should be a lot faster but at the cost of resources. 

I am not saying it is a bad idea. Not at all. But there must be a balance 
between speed and resource needs. The md5 approach is good in that it will 
allow modifications to the archive so that cached information may be 
invalidated. There is always the option to have this functionality as exactly 
that, optional. So small embedded systems should just be adviced not to use it.
Remember though that todays cache makes loading of archives equally fast 
(almost) over time. Having a fully cached approach will require some initial 
"cost" when the cache is populated, especially if you remove the --seek-length 
option.

If you are you are eager to get something going fast, please feel free to 
implement this on your own and branch it out from the trunk and then we can 
merge it back once you feel it is ready.

Original comment by hans.bec...@gmail.com on 30 May 2011 at 10:52

GoogleCodeExporter commented 9 years ago
Hey,

Thanks for the reply.
I know caching as it is works good. The thing is though that after a reboot or 
power-down this cache is gone.
I myself am not very eager for this to be implemented any time soon but I 
thought it might be handy. The options you mentioned did help a lot indeed. 
I'm afraid I'm not a C-coder so there is nothing I can do to help building it, 
sorry.

Cheers,
Joris

P.s: I've submitted a freebsd port for rar2fs (and libunrar 4.0.7) :)

Original comment by wiebel@gmail.com on 30 May 2011 at 1:15

GoogleCodeExporter commented 9 years ago
Thinking about it, I do not really see the md5 approach as possible. After all, 
you need to compare the cached md5 checksum with something and that something 
you would be forced to perform at every access. 

I think it is then better to actually waste some memory (optional) and add an 
additional cache for the archive file paths. Similar to what is done for files 
inside the archive today. Each entry in the cache would then link to the 
already extracted archive file structure. That could then be fed to the parser 
instead of the file handle. Today rar2fs does not subscribe to file system 
events, so modifications to RAR archives will pass unnoticed. I think that is a 
reasonable limitation, and it would not change by introducing a second cache. 
With this second cache (or first really), some disk/network IO will be avoided 
since the file header is no longer required. 

Original comment by hans.bec...@gmail.com on 30 May 2011 at 1:18

GoogleCodeExporter commented 9 years ago

Original comment by hans.bec...@gmail.com on 1 Jun 2011 at 8:48