jpmorganchase / jupyter-fs

A filesystem-like contents manager for multiple backends in Jupyter
Apache License 2.0
205 stars 36 forks source link

Use `fsspec` instead of `pyfilesystem2` #102

Open dhirschfeld opened 3 years ago

dhirschfeld commented 3 years ago

Before fsspec existed I used pyfilesystem2 and was very happy with it - it's a great library however it (apparently) didn't meet all the requirements for dask so fsspec was built, primarily to support dask, but it's also used in intake and as a generic filesystem api. As such it has a robust community around it and is continually improving and maturing.

Coming from the distributed computing world it has first-class support for cloud storage, and in particular (for my use-case) Azure Data Lake.

I haven't actually used the cloud storage plugins in pyfilesystem2 but they don't seem to have a lot of development momentum behind them, unlike fsspec.

To better support cloud filesystems I think it would be great if jupyter-fs could make use of fsspec rather than pyfilesystem2

dhirschfeld commented 3 years ago

TBF, I think fsspec still isn't quite as mature as pyfilesystem2 and doesn't have quite as polished of an api, however it does seem to have much better support for the use-cases I care about.

dhirschfeld commented 3 years ago

xref: #7

telamonian commented 3 years ago

I've been talking to @martindurant about fsspec for a while now (he's the creator). My current preference is to not throw the baby out with the pyfilesystem bathwater, and instead include some kind of support for both pyfilesystem2 and fsspec. Martin has actually been kind enough to get an implementation of fsspec for jupyter-fs started in his changes branch here.

@dhirschfeld I don't have a huge quantity of bandwidth to work on jupyter-fs right now, and most of my effort is currently going towards the new tree-finder based filebrowser. But if you want to take a crack at it I would not say no to a fsspec PR

dhirschfeld commented 3 years ago

It's an itch I'd like to scratch, but realistically won't have time to look at any time soon.

I'm using fsspec to access data on cloud storage from JupyterLab and I thought it would be nice to be able to browse that same storage from within JupyterLab to e.g. check if my f.write(data) call really worked. There's a slight friction having to switch to the Azure Portal to check if the files that should have been written to cloud storage really were written.

Unfortunately, since it's a "nice-to-have" rather than a "can't live without" I won't be able to invest time into it in the medium term - I can't even keep up with my can't live without's :/

reoono commented 2 years ago

Is there any update on this?

If not, I would like to work on this issue, to use fsspec for protocols not supported by pyfilesystem2.

My current preference is to not throw the baby out with the pyfilesystem bathwater, and instead include some kind of support for both pyfilesystem2 and fsspec.

Based on the above comments, I am considering either of the following policies, but would appreciate comments if you have a preference.

  1. Change the backend for each resource from setting as in the following example. (For backward compatibility, use pyfilesystem2 if not set.)

    {
    "resources": [
    {
      "name": "explicit_pyfilesystem2_resource",
      "url": "osfs:///Users/foo/test",
      "backend": "pyfilesystem2"
    },
    {
      "name": "implicit_pyfilesystem2_resource",
      "url": "osfs:///Users/foo/test",
    },
    {
      "name": "fsspec_resource",
      "url": "s3://test",
      "backend": "fsspec"
    },
    ]
    }
  2. Check if the protocol is supported by pyfilesystem2, and if so, use pyfilesystem2. Otherwise, use fsspec. https://github.com/PyFilesystem/pyfilesystem2/blob/master/fs/opener/registry.py#L93

If there is no preference, I would like to proceed with 1 for future expansion. Any comments or suggestions would be appreciated.

martindurant commented 2 years ago

Note that fsspec instances generally need more configuration. Whilst it is possible to set the default values for any particular protocol, it is very conceivable to want different configurations for, e.g., an owned bucket, a public bucket and a requestor-pays bucket on S3. (or even different S3-compatible service)

reoono commented 2 years ago

Thank you for your comment. I believe that the feature will be worthwhile even with default values at first, since it will also support protocols that are not yet supported by Pyfilesystem2. Therefore, I would like to proceed initially with default values, as is the current usage of Pyfilesystem. And what about more detailed configurations, which I would be willing to consider if necessary?

(Not related to the issue, but I also find fsspec useful on a daily basis. Thank you for developing a very cool and useful product)

reoono commented 2 years ago

I have started to implement the addition of fsspec.

Since fsspec.core.url_to_fs() is used internally to create instances, I began to think that making 'kwargs' configurable in addition to 'backend' would solve the problem you mentioned. (I would like to pass it like client_kwargs)

Of course, as an interface to JupyterLab's setting, this would be redundant. However, this is not a big problem because this function is only for users who want to do complicated things. (Basic users will still be able to use it with the same settings.)

martindurant commented 2 years ago

Thanks @reoono , let me know if I can help.

tharwan commented 3 weeks ago

Is the effort here related to https://github.com/fsspec/jupyter-fsspec

martindurant commented 3 weeks ago

jupyter-fsspec is "inspired" by this repo, and is only in early stages so far. If you would like to port any functionality or otherwise help develop it, that would be cool.

timkpaine commented 3 weeks ago

I do still plan on moving to fsspec eventually, there were some issues detailed here (as well as some others not written here) that were a problem, but they should be ok now.