khanlab / snakebids

Snakemake + BIDS
https://snakebids.readthedocs.io
MIT License
17 stars 14 forks source link

Support for Cloud Storage URLs (with universal_pathlib) #451

Closed pvandyken closed 1 month ago

pvandyken commented 2 months ago

Discussed in https://github.com/khanlab/snakebids/discussions/450

Originally posted by **@Karl5766** July 17, 2024 Ali and I are considering using Google Cloud buckets (with a access url like "gs://path/to/file") to read and write OME_ZARR files in snakemake workflows. We would like something as follows: ``` from snakebids.core.input_generation import generate_inputs inputs = generate_inputs( 'gs://khanlab-lightsheet/data/onuska_lifecanvas/bids', # instead of a local path, we pass in a string starting with "gs://" {'spim': {'filters': {'suffix': 'SPIM', 'subject': 'o21', 'extension': 'ome.zarr'}, 'wildcards': ['subject', 'sample', 'acquisition', 'staining']}} ) print(inputs) ``` while both pybids and this repo are currently depending on Python's Path class, Ali's idea is to use [universal_pathlib's UPath](https://github.com/fsspec/universal_pathlib) objects as drop-in replacement for Path. For pybids, We made a [fork](https://github.com/bids-standard/pybids/compare/master...Karl5766:pybids:master) of pybids which supports UPath and passes the unit tests of the main repository. For this repo, after further modify the `os.path.isabs(path)` call in [input_generation.py](https://github.com/khanlab/snakebids/blob/main/snakebids/core/input_generation.py) to make sure `gs://...` is not recognized as a local relative path, the above code seems to work. Depending on a pybids' fork is going to break dependency for other users of this repo (We are considering a PR to the main branch of pybids for UPath support. Right now, however, it's a fork and it's not finalized), but I think a commit replacing os.path.isabs() with an equivalent expression compatible with cloud storage paths is possible (so we can use a UPath compatible version of pybids in our own project if we need UPath support in snakebids). @pvandyken What do you think? Edit: I modified [input_generation.py](https://github.com/khanlab/snakebids/blob/main/snakebids/core/input_generation.py)'s line 726 to use `path.absolute() != path` to test for relative path which is a workaround but I think there is probably a better way to do it.
Response from @pvandyken July 17, 2024 UPath is a subclass of path, so should be sufficient to change that line to `path.is_absolute()` (after making sure the path is a pathlib). I would coerce to path immediately in the `generate_inputs()` function. `generate_inputs()` should then be typed with the `StrPath` fake-imported from typeshed: ```py from typing import TYPE_CHECKING if TYPE_CHECKING: from _typeshed import StrPath def generate_inputs( root: StrPath, ... ): if not isinstance(path, Path): path = Path(path) ... ``` This should make the function work with any subtype of Path (e.g. UPath), otherwise for strings etc it will coerce into a path. Then just make sure we don't use any of the old `os.path` functions anywhere. Should be easy enough to implement, are you willing? It would additionally need a test testing that both `PathLike` objects work (e.g. objects with `__fspath__()` implemented) and subclasses of `Path` work as expected.
Karl5766 commented 2 months ago

This more complicated than I thought, for pytest whenever fakefs_tmp is used it seems that the global pathlib.Path class is mocked by a pyfakefs' fake_pathlib.FakePathlibPathModule object, which causes issues since UPath.__init__() uses some hacks to get its superclass Path to initialize a new UPath instance. This will throw a TypeError with the following code I added to e.g. test_generate_inputs.py

def test_upath(fakefs_tmpdir: Path):
    from upath import UPath
    UPath('/some/linux/path')  # raises an exception where this shouldn't

Any idea how to resolve this issue?

I also see many tests failing with Data generation is extremely slow: Only produced 9 valid examples in 1.02 seconds (5 invalid ones and 0 exceeded maximum size). Try decrea…, is this expected? This happens when I run (OS: on both Windows and WSL)

poetry run poe test -n 6 -m "not docker"

Edit: Note this problem only exists if we have replaced all Path as UPath in the pybids repo. I'm also opening a PR for changes allowing the repo to support UPath.

pvandyken commented 1 month ago

This more complicated than I thought, for pytest whenever fakefs_tmp is used it seems that the global pathlib.Path class is mocked by a pyfakefs' fake_pathlib.FakePathlibPathModule object, which causes issues since UPath.init() uses some hacks to get its superclass Path to initialize a new UPath instance. This will throw a TypeError with the following code I added to e.g. test_generate_inputs.py

I wouldn't test with UPath specifically. At most, just make a very light subclass of Path that doesn't change anything (actually, Path itself has some "hacks" so subclassing it may not be straightforward. Python 3.12 introduces the with_segments method that makes this much easier, so possibly just run the test in python 3.12 and higher.

I also see many tests failing with Data generation is extremely slow: Only produced 9 valid examples in 1.02 seconds (5 invalid ones and 0 exceeded maximum size). Try decrea…, is this expected? This happens when I run (OS: on both Windows and WSL)

Unfortunately this is rather normal. On the gh-actions testing I have all the time limits disabled so it doesn't pop up. At some point it probably would be good to go through and address some of these more specifically, but it's not high priority. For now, I recommend not running the entire test suite on your machine but just the tests you need, and let the rest run on the server.

Karl5766 commented 1 month ago

Closing this issue as PR has been merged.