Closed thompsonmj closed 3 months ago
Excellent, fsspec
seems to have worked for my needs with HfFileSystem
, e.g.
import os
from huggingface_hub import HfFileSystem
from fsspec.fuse import run
remote_path = "hf://datasets/imageomics/TreeOfLife-10M/dataset/EOL/"
fuse_mount_point = "./fuse_mnt/"
os.makedirs(fuse_mount_point, exist_ok=True)
fs = HfFileSystem()
tol_repo = "datasets/imageomics/TreeOfLife-10M/dataset/EOL"
run(fs, remote_path, fuse_mount_point, foreground=True)
Then using the pre-processed index files with:
ratarmount ./fuse_mnt/image_set_01.tar.gz ./mnt_01 --index-file ./dataset_index/image_set_01.tar.index.sqlite
I am able to work with contents as if they were on the local drive!
I can create quick access to contents in a
.tar.gz
file withratarmount --index-file /path/to/file.tar.index.sqlite /path/to/file.tar.gz /path/to/mnt
, where the index file was created withratarmount
.Would it be possible to use the local
index.sqlite
and to create a mount point to the.tar.gz
file if it were only available in a remote location and not available locally?The aim would be indexed random access to remote compressed archive contents in an environment where we cannot have the full
.tar.gz
dataset maintained locally.I'm not sure if there is a plausible workaround with current methods using e.g.
requests
with byte ranges and offsets and other metadata from theratarmount
-generated index file.