mxmlnkn / ratarmount

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
MIT License
701 stars 36 forks source link

Removing different subdirectory prefixes for all TARs in a folder? #107

Closed frankang closed 1 year ago

frankang commented 1 year ago

Hello~ First, I would like to express my appreciation for the usefulness of this repository. It saved me a lot of time and effort in working with TAR files.

I have a folder that contains multiple TAR files, each with a different subdirectory prefix (e.g. path1, path2, etc.). I would like to mount this folder using ratarmount, but I want to strip out the subdirectory prefix (path1/, path2/, ... ) for all of them. Is there a way to achieve this? My TAR folder looks like this:

tar_file1.tar
   - path1/1-file1
   - path1/1-file2
   - path1/1-file3
   - ...

tar_file2.tar
   - path2/2-file1
   - path2/2-file2
   - path2/2-file3
   - ...

Any help would be greatly appreciated. Thank you again for this amazing tool!

mxmlnkn commented 1 year ago

I do have a hacky solution:

# Generate test case
mkdir foo1
mkdir foo2
echo a > foo1/a
echo b > foo1/b
echo c > foo2/c
echo d > foo2/d
tar cf foo1.tar foo1
tar cf foo2.tar foo2

# Use (deprecated) --prefix or -o subdir... to get rid of the prefix for each archive
ratarmount -o modules=subdir,subdir=foo1 foo1.tar mounted-foo1
ratarmount -o modules=subdir,subdir=foo2 foo2.tar mounted-foo2

# Unite both mounts
ratarmount mounted-foo1 mounted-foo2 mounted

# Check results
tree mounted

Output:

mounted
├── a
├── b
├── c
└── d

0 directories, 4 files

It could be nice to generalize the --transform-recursive-mount-point into a --transform-paths option.

frankang commented 1 year ago

Thank you. The above workaround works, however there is a noticeable latency increase (at lease 5X) when opening the files in the final "merged" mountpoint. Any suggestion to get around the latency? There are over 100,000 files in my merged folder, and the CPU usage of the ratarmount process is very high during the file-open wait time.

mxmlnkn commented 1 year ago

Are the 100k files all in the top-level? One problem with union mount is that it has to find out to which archive a requested file belongs to. A cache exists to deal with this case but it has some thresholds. If the cache cannot be used, then on each file access, it has to check all union-mounted archives (or folders) whether it has the requested file. Because of this, the lookup time increases with the number of union mount sources.

Is there any output regarding the cache when calling ratarmount -d 3 -f mounted-foo1 mounted-foo2 mounted with -d 3?

frankang commented 1 year ago

Thank you, yes, the files are all in the top-level. The output running the mount command with -d 3 option is:

Building cache for union mount (timeout after 60s)...
Cached mount sources for 1 folders up to a depth of 0 in 4.2e-05s for faster union mount.

Looks like it hasn't built any cache yet? Any option to force building this cache?

mxmlnkn commented 1 year ago

Unfortunately not (yet). I have attached an AppImage of the current development version that has these command line options:

  --union-mount-cache-max-depth UNION_MOUNT_CACHE_MAX_DEPTH
                        Maximum number of folder levels to descend for building the union mount cache. (default: 1024)
  --union-mount-cache-max-entries UNION_MOUNT_CACHE_MAX_ENTRIES
                        Maximum number of paths before stopping to descend into subfolders when building the union mount cache. (default: 100000)
  --union-mount-cache-timeout UNION_MOUNT_CACHE_TIMEOUT
                        Timeout before stopping to build the union mount cache. (default: 60)

Try it with something like ratarmount.AppImage --union-mount-cache-timeout 3600 --union-mount-cache-max-entries 10000000 ... (10 million files maximum, and maximum 1h).

ratarmount-manylinux2014_x86_64.AppImage.zip

frankang commented 1 year ago

Running the AppImage gives the error below: fusermount: mount failed: Operation not permitted [Error] FUSE mountpoint could not be created. See previous output for more information. Whatever, I will try the develop branch later and see if it works. Thanks.

mxmlnkn commented 1 year ago

Weird... For the development branch, try:

python3 -m pip install --user --force-reinstall \
    'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core' \
    'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount'
frankang commented 1 year ago

Looks like the isImmutable property in the FolderMountSource.py file is always false for a FolderMountSource object, causing an empty folderCache in the UnionMount process.

https://github.com/mxmlnkn/ratarmount/blob/e61800c4d2ddedc46a2937fce2701ec3039788e5/core/ratarmountcore/FolderMountSource.py#L54-L58

https://github.com/mxmlnkn/ratarmount/blob/4ef96f038bf1096c5b9441c4d00e6a43964efbde/core/ratarmountcore/UnionMountSource.py#L55-L57

mxmlnkn commented 1 year ago

Ah, I didn't think of that. But then why was there a timeout in your output :/. It shouldn't even start to build the cache then. It should also be possible to detect folders residing on read-only filesystems and flag those as immutable. It might even be possible to use something like inotify to detect file creations and deletions inside the folders so that the cache can be updated accordingly. With that the immutability check wouldn't be necessary anymore.

frankang commented 1 year ago

Does the “/” sign in self.folderCache = {"/": [m for m in self.mountSources if m.isImmutable()]} mean an operation after timeout? https://github.com/mxmlnkn/ratarmount/blob/4ef96f038bf1096c5b9441c4d00e6a43964efbde/core/ratarmountcore/UnionMountSource.py#L55-L57 I also tried setting the isImmutable() to be always True in the foldermountsource.py file, then it does require more time to build the cache, but the file access latency only decreased marginally.

mxmlnkn commented 1 year ago

Does the “/” sign in self.folderCache = {"/": [m for m in self.mountSources if m.isImmutable()]} mean an operation after timeout?

No. This is the cache dictionary and the / key is only the path and the values are all possible mount sources the path exists in. Ratarmount then checks the cache for a given path and then only queries the mount sources, which actually have that path.

But this explanation and your observation about slower speed makes me realize that this cache might only cache folders :/. So if you have 100k files directly inside /, then this cache shouldn't even grow larger than one element because there are no subfolders.

I also tried setting the isImmutable() to be always True in the foldermountsource.py file, then it does require more time to build the cache, but the file access latency only decreased marginally.

That's a good idea for testing! I'm sorry it didn't improve performance. My first thought was that the cache dictionary simply isn't suited for 100k+ size but after the realization above I think it might be that the cache never actually grows because it only caches folders not files.

The next steps in my opinion would be to check how large the cache actually grows, so a simple print(len(self.folderCache)) and if it indeed only is a few elements long then the next step could be to try and also cache file paths.

For example, I see this line if not fileInfo or not stat.S_ISDIR(fileInfo.mode):. It might already suffice to change it to if not fileInfo: and to remove line 123-128, i.e., the whole if condition starting with elif self.folderCache and self.folderCacheDepth > 0 and path.startswith('/'):.

And in the long term, it might be preferable to implement this cache in an SQLite database to support really large use cases.

mxmlnkn commented 1 year ago

I want to create some benchmarks to look into it in depth. I need some further information about your setup:

I have created 10 tar files with each 4k files and as expected, simply mounting them has no noticable slowdown even though all of those 100k files are in root. However, then mounting them separately and then union mounting those mountpoints, the latencies for a simple cat to one of those 4 B large files increases fro 2ms to 2s. And after that I was unable to unmount that mountpoint for a while because it was busy. Very weird. It definitely should not be this much slower.

But, this also very likely means that offering a --transform option like GNU tar should fix that performance problem too. I'll try to push this onto the develop branch this weekend.

mxmlnkn commented 1 year ago

I've pushed a quick implementation of a --transform option to the develop branch. This option is only applied during index creation, so you should use it with -c or remove the index before.

Create test archives:

mkdir folder1
mkdir folder2
echo hi > folder1/a
echo there > folder2/b
tar cf folder1.tar folder1
tar cf folder2.tar folder2

Use like this to remove the top-level folder:

> ratarmount -c folder1.tar folder2.tar mountpoint
> tree mountpoint

mountpoint
├── folder1
│   └── a
└── folder2
    └── b

2 directories, 2 files

> ratarmount -u mountpoint
> ratarmount -d 3 -c --transform '^/[^/]*/' '' folder1.tar folder2.tar mountpoint
> tree mountpoint

mountpoint
├── a
└── b

0 directories, 2 files

I have not tested the performance of this yet and there is still much to do to implement this correctly and consistently in ratarmount. Currently, it only works for TAR archives.

frankang commented 1 year ago

Oops, sorry for my late reply, will test the --transform option ASAP. Thanks again for the quick fix!

frankang commented 1 year ago

Tried the --transform option last week and tested it for a few days, I think it did solve the problem. Thank you for the solution!

mxmlnkn commented 1 year ago

Thank you for the feedback. I'll aim to include it in the next release.

lacek commented 9 months ago

May I know if the --transform option is released yet?

It is included in the README.md of version 0.14.0 (https://github.com/mxmlnkn/ratarmount/blob/v0.14.0/README.md?plain=1#L303). But I'm not seeing the option in the help message (ratarmount -h) of the same version.

Output of ratarmount --version:

ratarmount 0.14.0
ratarmountcore 0.6.1

System Software:

Python 3.11.6
FUSE 2.9.9
libsqlite3 3.40.1

Compression Backends:

indexed_bzip2 1.4.1
rapidgzip 0.10.4
indexed_gzip 1.8.7
xz 0.4.0
indexed_zstd 1.1.3
rarfile 4.1

Edit:

Just found that the commit 815dbe3e0cecffcece546f9d58319f4c7952ec5a is only in develop branch and commit message indicates that it's a WIP. So I guess the README.md is accidentally updated. I'd look for other workaround for my use case then.

mxmlnkn commented 9 months ago

You solved your question yourself, but as there is more demand, I'll take that as a cue to finish that feature with higher priority. The commit message also lists the open issues that I want to fix before merging it. I guess I could also live without it working for RAR files if I don't find time to implement this feature there.