polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
72 stars 4 forks source link

`polaris.utils.errors.PolarisHubError: Error opening Zarr store` at dataset upload #147

Closed fteufel closed 1 month ago

fteufel commented 1 month ago

Polaris version

dev

Python Version

3.10

Operating System

Linux

Installation

pip

Description

I'm trying to upload a zarr dataset. Not doing anything special as far as I can tell - I think the zarr upload fails, but the dataset gets created on the hub anyway. Not sure what's going wrong

2024-07-21 17:05:44.769 | INFO     | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets.
Finding all files in the Zarr archive: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1027/1027 [00:00<00:00, 2670.07it/s]
💥 ERROR: Failed to upload dataset. 
Traceback (most recent call last):
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 330, in open_zarr_file
    return zarr.open(store, mode=mode)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/convenience.py", line 123, in open
    return open_group(_store, mode=mode, **kwargs)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/hierarchy.py", line 1581, in open_group
    init_group(store, overwrite=True, path=path, chunk_store=chunk_store)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/storage.py", line 682, in init_group
    _init_group_metadata(store=store, overwrite=overwrite, path=path, chunk_store=chunk_store)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/storage.py", line 704, in _init_group_metadata
    rmdir(store, path)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/storage.py", line 212, in rmdir
    store.rmdir(path)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/storage.py", line 1548, in rmdir
    if self.fs.isdir(store_path):
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/fsspec/spec.py", line 705, in isdir
    return self.info(path)["type"] == "directory"
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/fsspec/spec.py", line 665, in info
    out = self.ls(self._parent(path), detail=True, **kwargs)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/polarisfs.py", line 94, in ls
    response.raise_for_status()
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://polarishub.io/api/v1/storage/dataset/mlls/BEND_chromatin_accessibility/ls'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/novo/users/fegt/BEND/scripts/upload_polaris_datasets.py", line 236, in <module>
    dataset.upload_to_hub(owner='mlls')
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/dataset/_dataset.py", line 372, in upload_to_hub
    self.client.upload_dataset(self, access=access, owner=owner)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 587, in upload_dataset
    dest = self.open_zarr_file(
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 333, in open_zarr_file
    raise PolarisHubError("Error opening Zarr store") from e
polaris.utils.errors.PolarisHubError: Error opening Zarr store

Steps to reproduce

# i have a `df` and a `mulithot_labels` array

root = zarr.open('chromatin.zarr', "w")
root.array("labels", multihot_labels) # np array (n_samples, 125)
zarr.consolidate_metadata('chromatin.zarr') # this seems necessary, not sure why

df['label'] = [f'labels#{i}' for i in range(len(df))]

annotations = {
    "sequence": ColumnAnnotation(
        # modality="dna",
        description="The nucleotide sequence of the DNA region",
        # user_attributes={"unit": "mL/min/kg"},
    ),
    "strand": ColumnAnnotation(
        description="The strand of the DNA region",
    ),
    "chromosome": ColumnAnnotation(
        description="The chromosome of the DNA region",
    ),
    "start": ColumnAnnotation(
        description="The start coordinate of the DNA region",
    ),
    "end": ColumnAnnotation(
        description="The end coordinate of the DNA region",
    ),
    "label": ColumnAnnotation(
        description="The labels indicating the chromatin accessibility of the DNA region in the cell lines",
        is_pointer=True
    ),
}

dataset = Dataset(
    # The table is the core data-structure required to construct a dataset
    table=df.loc[:, ["sequence", "strand", "chromosome", "start", "end", "label"]],
    # Additional meta-data on the dataset level.
    name="BEND_chromatin_accessibility",
    description="Multilabel classification of chromatin accessibility in cell lines from the BEND benchmark",
    source="https://doi.org/10.1038/nature11247",
    annotations=annotations,
    curation_reference="https://arxiv.org/abs/2311.12570",
    owner=HubOwner(user_id="fteufel", slug="fteufel"),
    user_attributes={"year": "2023"},
    zarr_root_path="chromatin.zarr",
    license="CC-BY-4.0"
)

print(dataset.get_data(row=1, col='label'))

dataset.upload_to_hub(owner='mlls')

Additional output

No response

cwognum commented 1 month ago

Thanks for reporting, @fteufel !

I think this would be fixed by #146. It's because we use the dataset name, rather than the slug (i.e. only lowercase letters and dashes).

cwognum commented 1 month ago

Uploading Zarr is a multi-stage process, so it can happen that you do see the dataset on the Hub already, even though a later step in the upload process fails. However, you should always see a banner at the top stating that the dataset upload has not completed yet.

cwognum commented 1 month ago

Thanks to @zhu0619, #146 is now merged and this fix was included in release 0.7.3!

Could you try upgrading Polaris to the latest version and let me know if the issue persists?

fteufel commented 1 month ago

It worked now for another dataset.

The one that failed at upload was broken, with the banner being there for 16+ hours. I deleted it in the UI in order to upload it again. Now I get

  "message": "Dataset 'bend-chromatin-accessibility', with slug 'bend-chromatin-accessibility', already exists"

Do I need to refresh something to make the delete be registered?

zhu0619 commented 1 month ago

@fteufel The reason is the metadata is already registered in the database. I just removed it from the database. You can try again.

fteufel commented 1 month ago

Still something hanging

2024-07-22 16:36:47.580 | INFO     | polaris._mixins:md5sum:27 - Computing the checksum. This can be slow for large datasets.
Finding all files in the Zarr archive: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1027/1027 [00:00<00:00, 2661.58it/s]
💥 ERROR: Failed to upload dataset. 
Traceback (most recent call last):
  File "/novo/users/fegt/BEND/scripts/upload_polaris_datasets.py", line 243, in <module>
    dataset.upload_to_hub(owner='mlls')
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/dataset/_dataset.py", line 377, in upload_to_hub
    self.client.upload_dataset(self, access=access, owner=owner)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 581, in upload_dataset
    hub_response.raise_for_status()
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/httpx/_models.py", line 761, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '405 Method Not Allowed' for url 'https://polarishub.io/storage/dataset/mlls/bend-chromatin-accessibility/table.parquet'
zhu0619 commented 1 month ago

@fteufel We are working on a solution to enable dataset updates and aim to make it available as soon as possible. For now, another quick solution for this is to change your dataset name.

cwognum commented 1 month ago

Hey there! I'm going to close this issue because it's no longer related to the original issue that was raised.

But before I do, I wanted to provide some context and hopefully solve your issue in the process @fteufel: We made the decision to implement deletion on the Hub as a soft-delete: You won't see deleted content on the Hub anymore, but the actual files still exist. That way we could restore deleted artifacts if needed.

This does imply, however, that you cannot create a dataset with the same name as a previously deleted artifact. In an effort to unblock you quickly, @zhu0619 manually hard-deleted the artifact from our database (solving this), but did not yet delete the associated files that had already completed uploading (causing this).

So where do we go next?

  1. For cases like yours where an upload only partially completes, we want to make it easier to retry uploading the files that failed. I created an issue for this: https://github.com/polaris-hub/polaris/issues/151 . Please use that issue to share any thoughts or ideas on how this should(n't) work.
  2. We need to do a better job at clearly communicating that a dataset name is unique (and thus explain the consequences of deleting an artifact). Any suggestion on where and how you would have expected such information?

This situation is an exceptional case. Moving forward, we never want to manually delete content because such a manual process is too error-prone and risks our data integrity. For just this once, however, given that we already deleted the entry from our database, we decided to also manually delete the associated, orphaned files from our storage backend. @fteufel This implies that uploading your dataset should work now!

If it doesn't, however, please reach out over Discord! Given that it's such an exceptional case, that's the better place to get personal support.

fteufel commented 1 month ago

Ok, understand now - would be great to have that spelled out in the confirmation popup you get when deleting.

But it failed again :(

⠹ Uploading dataset...2024-07-23 09:55:01.849 | INFO     | polaris.hub.client:upload_dataset:602 - Copying Zarr archive to the Hub. This may take a while.
💥 ERROR: Failed to upload dataset. 
Traceback (most recent call last):
  File "/novo/users/fegt/BEND/scripts/upload_polaris_datasets.py", line 243, in <module>
    dataset.upload_to_hub(owner='mlls')
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/dataset/_dataset.py", line 377, in upload_to_hub
    self.client.upload_dataset(self, access=access, owner=owner)
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 603, in upload_dataset
    zarr.copy_store(
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/convenience.py", line 756, in copy_store
    dest[dest_key] = data
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/zarr/storage.py", line 1470, in __setitem__
    self.map[key] = value
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/fsspec/mapping.py", line 175, in __setitem__
    self.fs.pipe_file(key, maybe_convert(value))
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/polarisfs.py", line 223, in pipe_file
    raise PolarisHubError("Could not get signed URL from Polaris Hub.")
polaris.utils.errors.PolarisHubError: Could not get signed URL from Polaris Hub.
cwognum commented 1 month ago

@fteufel I see some data made it to our storage backend. I'm not sure why the upload failed midway through, but my best explanation is that your login expired as the upload happened or that you hit some timeout. You're helping us stress test the system here - The downside of working mostly with small molecules is that most datasets I'm used to are small! 😅

Lacking a formal retry mechanism, could you complete the upload using:

First, refresh your login token:

polaris login --overwrite

Then:

from polaris.hub.client import PolarisHubClient

# Create the exact same dataset locally again
dataset = ...

with PolarisHubClient() as client: 
    # Increase the timeout
    client.settings.default_timeout = (100, 2000)

    # Open the destination Zarr archive again
    dest = client.open_zarr_file(
        owner="mlls",
        name="bend-chromatin-accessibility",
        path="polarisfs://data.zarr",
        mode="w",
        as_consolidated=False,
    )

    # Copy the files to the destination, skipping any files that already exist. 
    # With this code, you will also see additional print output during the process that may help us debug if it fails again.
    logger.info("Copying Zarr archive to the Hub. This may take a while.")
    zarr.copy_store(
        source=dataset.zarr_root.store.store,
        dest=dest.store,
        log=print,
        if_exists="skip",
    )
fteufel commented 1 month ago
    dest = client.open_zarr_file(
  File "/novo/users/fegt/miniconda3/envs/bend/lib/python3.10/site-packages/polaris/hub/client.py", line 333, in open_zarr_file
    raise PolarisHubError("Error opening Zarr store") from e
polaris.utils.errors.PolarisHubError: Error opening Zarr store

But can also just wait until there is a retry mechanism for now.

cwognum commented 1 month ago

Thanks! #144 will also add more informative error messages for this case, helping us to investigate.

cwognum commented 1 month ago

I added myself temporarily to the mlls organization and I see the issue! It's because of mode="w". From the Zarr docs:

‘w’ means create (overwrite if exists);

Since we haven't implemented the delete operation, overwriting the archive fails.

This would be fixed by changing it to mode="a".