pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
123 stars 54 forks source link

Clarify what subset of fsspec's interface we use #137

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

I'm trying to push some data through pangeo-forge-recipes, and local testing is working fine. When I go to write to blob storage, I'm running into a convoluted set of bugs conspiring to make things fail (partially written files, stale caches, differences in implementations, plain bugs, ...)

To make some progress on my problem, I wrote a very thin wrapper around the azure.storage.blob interface that implements a MutableMapping, essentially replacing adlfs.AzureBlobFileSystem.get_mapper().

Given the extremely difficult task fsspec is trying to solve, I wonder if we can come up with a subset of the fsspec interface that we're using, in addition to Zarr's reliance on the MutableMapping interface. Currently, FSSpecTarget takes an instance of fsspec's AbstractFileSystem, so in theory a target would need to implement the entire interface. But I think we use much less of it. So far, I have a fake "AbstractFileSystem" that implements

and with that I've been able to process some data using the XarrayToZarr recipe.

So this issue would be to identify

  1. If this is worth doing (I'll continue to investigate the issues I'm seeing with adlfs / fsspec, but like I said, it's convoluted)
  2. If so, what pangeo-forge needs out of a target filesystem.
TomAugspurger commented 3 years ago

FYI @ciaranevans @sharkinsspatial

class AzureBlobStorageStore(MutableMapping):
    def __init__(self, container_client, root=""):
        if len(root):
            assert root[0] != "/"
            assert root[-1] == "/"
        self.container_client = container_client
        self.root = root

    def __getitem__(self, key):
        key = os.path.join(self.root, key)
        with self.container_client.get_blob_client(key) as bc:
            try:
                stream = bc.download_blob()
            except ResourceNotFoundError as e:
                raise KeyError(key) from e
            data = stream.readall()
        return data

    def __setitem__(self, key, value):
        key = os.path.join(self.root, key)
        # bug in zarr? xarray?
        if hasattr(value, "size") and value.size == 1 and hasattr(value, "tobytes"):
            value = value.tobytes()

        with self.container_client.get_blob_client(key) as bc:
            bc.upload_blob(value, overwrite=True)

    def __delitem__(self, key):
        key = os.path.join(self.root, key)
        with self.container_client.get_blob_client(key) as bc:
            bc.delete_blob()

    def __iter__(self):
        prefix_len = len(self.root)
        return (
            x["name"][prefix_len:] for x in self.container_client.list_blobs(self.root)
        )
        return self.lisdir()

    def __len__(self):
        return len(list(self.container_client.list_blobs(self.root)))

class AzureBlobStorageFS:
    def __init__(self, container_client):
        self.container_client = container_client

    def get_mapper(self, root):
        return AzureBlobStorageStore(self.container_client, root)

    def isdir(self, path):
        return True

usage

    target = FSSpecTarget(fs_remote, f"{frequency}/{region}.zarr/")
    recipe.target = target
rabernat commented 3 years ago

When I go to write to blob storage, I'm running into a convoluted set of bugs conspiring to make things fail (partially written files, stale caches, differences in implementations, plain bugs, ...)

Yes I feel your pain! This is partly what prompted https://github.com/pangeo-data/pangeo-integration-tests/issues/1.

It is relatively easy to write your own limited cloud storage adaptor. On the other hand, we have @martindurant on contract to help us with all fsspec-related issues. I would like to see if we can't fix things upstream first, rather than forking.

martindurant commented 3 years ago

This sort of goes around the circle of the specific mappers that exist now in zarr. Yes, you can certainly make things simpler by specialising your storage backend to be mappers only. I do worry about multiple implementations, though - when we learn about how to do things well for a particular backend, we would need to port it to both places. Also, this mechanism skips other things fsspec might offer, such as the discussion about non-local caching: pangeo-forge can implement this, but I would rather it was upstream and useful to a wider community. Obviously I am not impartial here.

rabernat commented 3 years ago

Zarr also has its own bespoke ABS store: https://github.com/zarr-developers/zarr-python/blob/master/zarr/storage.py#L2185

ciaransweet commented 3 years ago

@TomAugspurger what is fs_remote? Can I just drop the above classes into https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/add-k8s-cluster/flow_test/transform_flow.py for the time being? Or where would you put them?

@rabernat @martindurant I've no real understanding of the underlying fsspec/adlfs stuff, but I'm definitely also experiencing some 'fun' with Azure Blob Storage and recipes: https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/4#issuecomment-840517564

rabernat commented 3 years ago

Regardless of what route we choose here, I agree that formalizing the the fsspec interface we use is a good idea. Since #133, all fsspec-related stuff has lived exclusively in storage.py, so it should be relatively easy.

TomAugspurger commented 3 years ago

Agreed we don't want to fork away from fsspec or reimplement stuff. @ciaranevans was also hitting issues in adlfs / fsspec and I wanted to get them unblocked by sharing that snippet. @ciaranevans the full context is at https://gist.github.com/TomAugspurger/c1773efa06d71443f135eb1af8832b82, which includes defining fs_remote.

The potential action item for this issue is to decide if we can rely on just Zarr's usage of MutableMapping (+ maybe listdir and rmdir), which fsspec implements. That would maybe limit the API surface that pangeo-forge is exposed to. Currently it looks like we use isdir and mkdir to initialize the Bucket / Storage account. Maybe that could be done ahead of time by the bakery.

That said, just limiting the API surface wouldn't have fixed things in this case, as the bugs I'm hit by the mutable mapping API codepath.