hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 7 forks source link

[Bug]: Cannot overwrite existing HDF5 .nwb file with mode="w" #117

Open CodyCBakerPhD opened 11 months ago

CodyCBakerPhD commented 11 months ago

What happened?

Ran into this during testing on NeuroConv

Steps to Reproduce

from pynwb import NWBHDF5IO
from hdmf_zarr import NWBZarrIO
from pynwb.testing.mock.file import mock_NWBFile
from pynwb.testing.mock.base import mock_TimeSeries

with NWBHDF5IO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:
    nwbfile = mock_NWBFile()
    nwbfile.add_acquisition(mock_TimeSeries())
    io.write(nwbfile)

with NWBZarrIO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:
    nwbfile = mock_NWBFile()
    nwbfile.add_acquisition(mock_TimeSeries())
    io.write(nwbfile)

Traceback

Traceback (most recent call last):

  Cell In[8], line 1
    with NWBZarrIO(path="C:/Users/Raven/Downloads/test_hdf5.nwb", mode="w") as io:

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\nwb.py:52 in __init__
    super(NWBZarrIO, self).__init__(path,

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\backend.py:124 in __init__
    super().__init__(manager, source=source_path)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\utils.py:644 in func_call
    return func(args[0], **pargs)

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf\backends\io.py:41 in __init__
    self.open()

  File C:\Anaconda3\envs\neuroconv_3_10\lib\site-packages\hdmf_zarr\backend.py:156 in open
    self.__file = zarr.open(store=self.path,

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\convenience.py:102 in open
    _store: BaseStore = normalize_store_arg(

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:197 in normalize_store_arg
    return normalize_store(store, storage_options, mode)

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:177 in _normalize_store_arg_v2
    return DirectoryStore(store)

  File ~\AppData\Roaming\Python\Python310\site-packages\zarr\storage.py:1062 in __init__
    raise FSPathExistNotDir(path)

FSPathExistNotDir: path exists but is not a directory: %r

Operating System

Windows

Python Executable

Conda

Python Version

3.10

Package Versions

No response

Code of Conduct

oruebel commented 11 months ago

Thanks for issue and helpful example. Using the same file with different backends (here creating test_hdf5.nwb with HDF5 and then overwriting with Zarr) is not possible in most cases. I think the best we can do here is to check if the existing file is readable with the I/O backend and raise a ValueError to indicate that the existing file uses a different backend. What do you think?

CodyCBakerPhD commented 11 months ago

I don't plan on using the same file with a different backend; the mode="w" to me, indicates the behavior 'delete if existing' which shouldn't matter if it is one backend or another

CodyCBakerPhD commented 11 months ago

Example: I tried writing my NWB file using DANDI filename convention ./sub-{subject_id}_ses-{session_id}.nwb first using the HDF5 backend, then decided I wanted to switch to Zarr

Slightly annoying if I have to manually delete the file file if it exists before writing it again; I've always utilized the auto-overwrite of mode="w" to achieve this

oruebel commented 11 months ago

the mode="w" to me, indicates the behavior 'delete if existing' which shouldn't matter if it is one backend or another

In this case we would need to manually delete the file if it from a different backend before opening it. I.e., the logic would need to be something like

if self.mode == 'w':
  if os.path.exists(filename) and not self.can_read(filename):
      if os.path.isdir(filename):
          shutil.rmtree(filename)
      else:
         os.remove(filename)

This would probably need to be done in all backends (HDF5 and Zarr). Since this is deleting files that are not from the particular backend, I think it would be best to make this configurable, e.g., by specifing and additional parameter, e.g., force_overwrite=True, to make sure that this is something the user explicitly decides to do and avoid users accidentally deleting files.

rly commented 11 months ago

This is ultimately an issue with Zarr (and h5py) -- both libraries raise an error and do not overwrite if the filepath is of a different type than expected (see below). But we can improve on that user experience. I think @oruebel 's suggested fix above is good. By the way, the TAB is thinking of recommending using the ".nwb.zarr" suffix for NWB Zarr files. Alessio/AIND is already doing that.

When Zarr tries to write to a directory that exists as a non-zarr directory (OK):

>>> import zarr
>>> import os
>>> os.mkdir("temp")
>>> z1 = zarr.open("temp", mode="w", shape=(2, 2), dtype="i4")
>>> z1[:] = 42  # creates zarr structure with data

When Zarr tries to write to a directory that exists as a file (error):

>>> import zarr
>>> with open("temp.txt", "w"):
...     pass
... 
>>> zarr.open("temp.txt", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/convenience.py", line 96, in open
    _store: BaseStore = normalize_store_arg(
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 181, in normalize_store_arg
    return normalize_store(store, storage_options, mode)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 163, in _normalize_store_arg_v2
    return DirectoryStore(store)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/zarr/storage.py", line 1036, in __init__
    raise FSPathExistNotDir(path)
zarr.errors.FSPathExistNotDir: path exists but is not a directory: %r

When h5py tries to write to a file that exists as a non-hdf5 file (OK):

>>> import h5py
>>> import os
>>> os.mkdir("temp")
>>> with open("temp.txt", "w"):
...     pass
... 
>>> with h5py.File("temp.txt", "w"):  # creates empty hdf5 file
...     pass
... 

When h5py tries to write to a file that exists as a directory (error):

>>> import h5py
>>> import os
>>> os.mkdir("temp")
>>> h5py.File("temp", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/Users/rly/mambaforge/envs/hdmfx/lib/python3.10/site-packages/h5py/_hl/files.py", line 237, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 126, in h5py.h5f.create
IsADirectoryError: [Errno 21] Unable to create file (unable to open file: name = 'temp', errno = 21, error message = 'Is a directory', flags = 13, o_flags = 602)