mxmlnkn / ratarmount

Access large archives as a filesystem efficiently, e.g., TAR, RAR, ZIP, GZ, BZ2, XZ, ZSTD archives
MIT License
709 stars 36 forks source link

Possible to mount a remote archive? #138

Closed thompsonmj closed 3 months ago

thompsonmj commented 3 months ago

I can create quick access to contents in a .tar.gz file with ratarmount --index-file /path/to/file.tar.index.sqlite /path/to/file.tar.gz /path/to/mnt, where the index file was created with ratarmount.

Would it be possible to use the local index.sqlite and to create a mount point to the .tar.gz file if it were only available in a remote location and not available locally?

The aim would be indexed random access to remote compressed archive contents in an environment where we cannot have the full .tar.gz dataset maintained locally.

I'm not sure if there is a plausible workaround with current methods using e.g. requests with byte ranges and offsets and other metadata from the ratarmount-generated index file.

mxmlnkn commented 3 months ago

It's not (yet) possible, but I'd like to add it by incorporating fsspec. Currently, you can however combine different mounting tools with ratarmount. For example: fsspec, which had a work-in-progress FUSE-binding at one point that I cannot find anymore right now, httpfs, httpfs2.

thompsonmj commented 3 months ago

Excellent, fsspec seems to have worked for my needs with HfFileSystem, e.g.

import os
from huggingface_hub import HfFileSystem
from fsspec.fuse import run

remote_path = "hf://datasets/imageomics/TreeOfLife-10M/dataset/EOL/"
fuse_mount_point = "./fuse_mnt/"

os.makedirs(fuse_mount_point, exist_ok=True)

fs = HfFileSystem()

tol_repo = "datasets/imageomics/TreeOfLife-10M/dataset/EOL"

run(fs, remote_path, fuse_mount_point, foreground=True)

Then using the pre-processed index files with:

ratarmount ./fuse_mnt/image_set_01.tar.gz ./mnt_01 --index-file ./dataset_index/image_set_01.tar.index.sqlite

I am able to work with contents as if they were on the local drive!