ibizaman / pcachefs

FUSE filesystem that presents a mirror of other filesystems, with transparent caching.
Apache License 2.0
43 stars 9 forks source link

pCacheFS freezes during a file copy to cache... #4

Open hradec opened 7 years ago

hradec commented 7 years ago

I'm trying to use pCacheFS with SSHFS, but I'm experiencing a lot of freezing during opening of large files.

In SSHFS, the filesystem access works normaly during a large file open... no problem!

But using pCacheFS over it, everything stops until the file is opened (finished transfer from the remote ssh server). Where "everything" means even a "ls" of a folder in the cached filesystem stays frozen until the transfer is finished.

I'm willing to fix it, so if you guys could explain a bit the code, I can give it a go!

I'm also interested in add parallel caching of different parts of a file, to increase throughput over SSHFS or any other filesystem used over WAN.

This would be particularly useful to use with google cloud computing VM's, since SSHFS seems to be limited to 1.5mb/sec on a single connection.

With multiple transfers over SSHFS, the overall speed is limited by the remote server upload bandwidth!

thanks lots! -H

ibizaman commented 7 years ago

Thanks for your interest in this. I in fact scavenged this project when it was treated to be taken down from http://code.google.com/p/pcachefs/.

I modified a bit the code to make it simpler although I really think we can do better. Simplify code, make better utilities, rename things, etc.

The main class is PersistentCacheFs. The tests I added are in test/test_all.py (The other ones aren't updated, I would remove them in fact). To understand what's going on, I would run pcachefs locally with the debug flag, look here for how to specify the directories.

If you can add tests in there for your use cases and add a fix, I'm down for it! Definitely having another thread doing the actual caching is necessary. I think that would be the first step to make the ls not hang and to have parallel caching of multiple parts of the file. Would be nice to have an argument specifying how many parallel threads the user would want.

Although I don't have time right now to make big changes on this project, I still have big plans for it. :sweat_smile: Something I really would like is to be able to say "pcachefs, please cache this file/folder". So the /.pcachefs holds a mirror of the normal filesystem but files are replaced by directories containing metadata about the actual file and what is being cached. Say the normal filesystem holds:

/
    a
    b/
        c

The pcachefs one will hold the same tree plus the .pcachefs directory:

/
    .pcachefs/
        a/
            cached
        b/
            cached   # Note I didn't get to do it on directories yet, it's a TODO for now :/
            c/
                cached

See these tests to follow along. And see Cacher to see what hidden files exist in addition to the cached one above.

The idea is that the cached file holds a floating number between [0, 1] for files (so for a and c) which represents the percentage of what is cached. And if the user wants to cache a file (without reading it), he can simply do echo 1 > cached. Then the cache mechanism should kick in and begin caching the file. If the user wants to remove caching the file (not deleting the file), he can simply do echo 0 > cached. The same should be allowed for directories too.

Note that I would want the user's preferences to be "sticky", it should take precedence over automatic caching through reading the file. Or maybe not, eh the spec is still in progress. If you have any ideas of how you would want to move this forward, please let me know! If you also want to be a maintainer, I'm looking forward to it.

ibizaman commented 7 years ago

I added a makefile and a linter to makes things easier.