hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 6 forks source link

[Bug]: `[0.7.0, 0.8.0]` Fails to open file with consolidated metadata from S3 #205

Open bjhardcastle opened 4 months ago

bjhardcastle commented 4 months ago

What happened?

185 highlights the need for a test for this, so maybe you're already aware of this issue.

The problem stems from self.is_remote() incorrectly returning False. Adding ConsolidatedMetadataStore to the type check here is enough to get it working again: https://github.com/hdmf-dev/hdmf-zarr/blob/afaab5fbca03a0979d9f234e13e43027f545c366/src/hdmf_zarr/backend.py#L185

Steps to Reproduce

import hdmf_zarr

hdmf_zarr.NWBZarrIO(
    path="s3://codeocean-s3datasetsbucket-1u41qdg42ur9/320c7f2a-f145-4a44-8421-2b2fe1d1acbc/nwb/ecephys_712815_2024-05-21_13-01-23_experiment1_recording1.nwb", 
    # not publicly accessible
    mode="r",
).read()

Traceback

Traceback (most recent call last):
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 492, in __open_file_consolidated
    return zarr.open_consolidated(store=store,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\convenience.py", line 
1335, in open_consolidated
    meta_store = ConsolidatedStoreClass(store, metadata_key=metadata_key)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\storage.py", line 2974, in __init__
    meta = json_loads(self.store[metadata_key])
                      ~~~~~~~~~~^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\storage.py", line 1115, in __getitem__
    raise KeyError(key)
KeyError: '.zmetadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\utils.py", line 668, in func_call
    return func(args[0], **pargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\backends\io.py", line 
56, in read
    f_builder = self.read_builder()
                ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf\utils.py", line 668, in func_call
    return func(args[0], **pargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1327, in read_builder
    f_builder = self.__read_group(self.__file, ROOT_NAME)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1392, in __read_group
    sub_builder = self.__read_group(sub_group, sub_name)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1401, in __read_group
    self.__read_links(zarr_obj=zarr_obj, parent=ret)
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 1420, in __read_links
    target_name, target_zarr_obj = self.resolve_ref(link)
                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 730, in resolve_ref
    target_zarr_obj = self.__open_file_consolidated(store=source_file,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\hdmf_zarr\backend.py", line 497, in __open_file_consolidated
    return zarr.open(store=store,
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ben.hardcastle\github\npc_sessions\.venv\Lib\site-packages\zarr\convenience.py", line 
132, in open
    raise PathNotFoundError(path)
zarr.errors.PathNotFoundError: nothing found at path ''

Operating System

Windows

Python Executable

Python

Python Version

3.11

Package Versions

No response

Code of Conduct

mavaylon1 commented 4 months ago

I am able to open a file with consolidated metadata using s3. I'll look into this.

mavaylon1 commented 4 months ago

Open works, but read does not for a certain file, but the error is not the same. I do not believe it is related.

mavaylon1 commented 3 months ago

@bjhardcastle I'm not able to reproduce the same error. Is it possible for you to create a public example? (a zarr file I can use)

mavaylon1 commented 3 months ago

We have tests that read consolidate metadata files that are passing so I am not sure what you are facing.

alejoe91 commented 3 months ago

@bjhardcastle I think this is due to some export functions on our side without the write_args={'link_data': False} as reported in this issue: https://github.com/hdmf-dev/hdmf-zarr/issues/179, not to a specific read of consolidated data from s3

On our end, all export functions have been now updated to force link_data: False. Can you try to generate a new NWB file with the pipeline and see if the problem persists? Once a fix on the hdmf-zarr is in place, we can drop the write_args

bjhardcastle commented 3 months ago

@mavaylon1 Are you talking about these tests? https://github.com/hdmf-dev/hdmf-zarr/commit/2ba6bc5cc83a68b5d5a4ec1e9816cd2913f8e7db

They aren't sufficient:

  1. they don't read the NWB file (as you said yourself, opening it is not actually the problem: reading it is)
  2. the zarr file is empty, so the reading fails for a different reason

Here's an MRE with reading added to your test:

# python -m venv .venv-hdmf-test
# .venv-hdmf-test/scripts/activate (Windows)
# source .venv-hdmf-test/bin/activate (Linux)
# python -m ensurepip
# python -m pip install hdmf-zarr fsspec s3fs

import hdmf_zarr 
import zarr 

s3_path = "https://dandiarchive.s3.amazonaws.com/zarr/ccefbc9f-30e7-4a4c-b044-5b59d300040b/"

with hdmf_zarr.NWBZarrIO(s3_path, mode='r') as read_io:
    read_io.open()
    assert isinstance(read_io.file.store, zarr.storage.ConsolidatedMetadataStore)
    try:
        # this fails:
        nwb = read_io.read()
    except Exception as exc:
        print(repr(exc))
        # ValueError: No data_type found for builder root

with hdmf_zarr.NWBZarrIO(s3_path, mode='-r') as read_io:
    read_io.open()
    assert isinstance(read_io.file.store, zarr.storage.FSStore)
    try:    
        # this fails:
        nwb = read_io.read()
    except Exception as exc:
        print(repr(exc))
        # hdmf.backends.errors.UnsupportedOperation: Cannot build data. There are no values.

# the zarr file is empty:
z = zarr.open(s3_path, mode='r')
assert not tuple(z.keys())
mavaylon1 commented 3 months ago

@bjhardcastle I am talking about our roundtrip tests, such as https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/tests/unit/base_tests_zarrio.py#L331C9-L331C40

In hdmf-zarr our default is to consolidate metadata, the test passes to read a file with a zarr.storage.ConsolidatedMetadataStore.

Thanks for the code. I will look into this.

bjhardcastle commented 3 months ago

Is that writing to an S3 bucket though?

mavaylon1 commented 3 months ago

@bjhardcastle No, but my point is that locally ConsolidatedMetadata works. We don't have a read test and I couldn't reproduce it without a s3 file.

But you gave one, so this should be enough for me to reproduce the error and fix it. :)

bjhardcastle commented 3 months ago

Yes it works locally because is_remote() returns False, which is correct.

When the file is remote, is_remote() also returns False, because ConsolidatedMetadataStore is not a subclass of FSStore. This messes up the paths to the underlying data.

The file I gave you is the one from your tests - but it's empty. I'm trying to upload a non-empty zarr NWB file via the dandi CLI now, but following the docs hasn't been successful yet. If you have any working code to do that I'd appreciate it..

mavaylon1 commented 3 months ago

I have to reproduce the error to comment on your findings regarding is_remote(). I will see what I can do to help on the DANDI side

mavaylon1 commented 3 months ago

@bjhardcastle At any point did you export?

bjhardcastle commented 3 months ago

@bjhardcastle At any point did you export?

Export what?

mavaylon1 commented 3 months ago

@bjhardcastle At any point did you export?

Export what?

Did you export a file to get to your current file.

bjhardcastle commented 3 months ago

@bjhardcastle At any point did you export?

Export what?

Did you export a file to get to your current file.

Sorry I don't know what you mean. Could you be more specific?

bjhardcastle commented 3 months ago

I have to reproduce the error to comment on your findings regarding is_remote()

Can't you see that the code doesn't make sense? The type of zarr object is not an indicator of whether the data is remote or not.

As a result, lines like this make paths that are completely wrong: https://github.com/hdmf-dev/hdmf-zarr/blob/8ca578733db5078d1ff6d2dfb47407a680c58caf/src/hdmf_zarr/backend.py#L719-L721

mavaylon1 commented 3 months ago

That could be the case. I have not taken a deep look at what the issue is. I usually like to reproduce the error so that any changes I make to address that error can be tested.

In any case, I will find some time next week to look at this as I have higher priority issues I've already committed to. Let me know when you figure out uploading your file. I will look into this as well.

Within NWB you can edit files and export them into a new file. If you did not make the file and are unsure, that is fine. It could be helpful to know, and I would suggest finding out if possible.

I invite you to also review our Code of Conduct.

oruebel commented 3 months ago

@bjhardcastle @mavaylon1 @alejoe91 thank you for the discussion to help us better understand the issue. We'll review the issue in our team meeting next week to make a plan to develop a fix for this issue and get back. I appreciate everyone's patients and hope you have a great weekend.

alejoe91 commented 3 months ago

@mavaylon1 if this can help, I got the same error on an NWB file produced by a double export (first adding electrical series, second adding units) and when opening the file locally.

Here's a link with the file so you can download it: https://drive.google.com/drive/folders/1f8_92cXrEvOdJWJvghNVNR_sDjpejmEO?usp=sharing

alejoe91 commented 3 months ago

Another update: switching the units export to an append mechanism seems to produce NWB files without the issue.

To summarize, we currently produce NWB files in multiple steps:

Here are the NWB zarr files produced:

So it seems that the issue is with adding the Units table with export. I hope this helps!

oruebel commented 3 months ago

Thanks you @alejoe91 for debugging this issue further and providing these additional details. We'll take a look.

mavaylon1 commented 3 months ago

@bjhardcastle we are actively investigating adjacent issues that may fix your ticket. Our goal is to either have a solution or a clear path to one by the end of next week.

alejoe91 commented 3 months ago

@mavaylon1 any lead on this?

mavaylon1 commented 1 month ago

@alejoe91 @bjhardcastle The issue the is stalling this is our implementation of export. I would find out who generated the files/exported the files as the files in general might not be valid.