hdmf-dev / hdmf-zarr

Zarr I/O backend for HDMF
https://hdmf-zarr.readthedocs.io/
Other
7 stars 6 forks source link

Writing a name with colon `:` fails in zarr in windows #219

Open h-mayorquin opened 1 month ago

h-mayorquin commented 1 month ago

This produces an error on windows (see the name of the TimeSeries)

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries

nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name: name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

from hdmf_zarr.nwb import NWBZarrIO
nwbfile_path = "test.nwb"

with NWBZarrIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)

Error trace:

Click here for error trace { "name": "KeyError", "message": "'acquisition/Name: name/.zgroup'", "stack": "--------------------------------------------------------------------------- NotADirectoryError Traceback (most recent call last) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\storage.py:1136, in DirectoryStore.__setitem__(self, key, value) 1135 try: -> 1136 os.makedirs(dir_path) 1137 except OSError as e: File :225, in makedirs(name, mode, exist_ok) NotADirectoryError: [WinError 267] The directory name is invalid: 'c:\\\\Users\\\\heberto\\\\development\\\ euroconv\\\\test.nwb\\\\acquisition/Name: name' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) Cell In[5], line 13 10 nwbfile_path = \"test.nwb\" 12 with NWBZarrIO(path=nwbfile_path, mode=\"w\") as io: ---> 13 io.write(nwbfile) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\utils.py:668, in docval..dec..func_call(*args, **kwargs) 666 def func_call(*args, **kwargs): 667 pargs = _check_args(args, kwargs) --> 668 return func(args[0], **pargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf_zarr\\backend.py:277, in ZarrIO.write(self, **kwargs) 267 cache_spec, number_of_jobs, max_threads_per_process, multiprocessing_context = popargs( 268 \"cache_spec\", \"number_of_jobs\", \"max_threads_per_process\", \"multiprocessing_context\", kwargs 269 ) 271 self.__dci_queue = ZarrIODataChunkIteratorQueue( 272 number_of_jobs=number_of_jobs, 273 max_threads_per_process=max_threads_per_process, 274 multiprocessing_context=multiprocessing_context, 275 ) --> 277 super(ZarrIO, self).write(**kwargs) 278 if cache_spec: 279 self.__cache_spec() File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\utils.py:668, in docval..dec..func_call(*args, **kwargs) 666 def func_call(*args, **kwargs): 667 pargs = _check_args(args, kwargs) --> 668 return func(args[0], **pargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\backends\\io.py:99, in HDMFIO.write(self, **kwargs) 97 \"\"\"Write a container to the IO source.\"\"\" 98 f_builder = self.__manager.build(container, source=self.__source, root=True) ---> 99 self.write_builder(f_builder, **kwargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\utils.py:668, in docval..dec..func_call(*args, **kwargs) 666 def func_call(*args, **kwargs): 667 pargs = _check_args(args, kwargs) --> 668 return func(args[0], **pargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf_zarr\\backend.py:440, in ZarrIO.write_builder(self, **kwargs) 436 f_builder, link_data, exhaust_dci, export_source, consolidate_metadata = getargs( 437 'builder', 'link_data', 'exhaust_dci', 'export_source', 'consolidate_metadata', kwargs 438 ) 439 for name, gbldr in f_builder.groups.items(): --> 440 self.write_group( 441 parent=self.__file, 442 builder=gbldr, 443 link_data=link_data, 444 exhaust_dci=exhaust_dci, 445 export_source=export_source, 446 ) 447 for name, dbldr in f_builder.datasets.items(): 448 self.write_dataset( 449 parent=self.__file, 450 builder=dbldr, (...) 453 export_source=export_source, 454 ) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\utils.py:668, in docval..dec..func_call(*args, **kwargs) 666 def func_call(*args, **kwargs): 667 pargs = _check_args(args, kwargs) --> 668 return func(args[0], **pargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf_zarr\\backend.py:527, in ZarrIO.write_group(self, **kwargs) 525 if subgroups: 526 for subgroup_name, sub_builder in subgroups.items(): --> 527 self.write_group( 528 parent=group, 529 builder=sub_builder, 530 link_data=link_data, 531 exhaust_dci=exhaust_dci, 532 ) 534 datasets = builder.datasets 535 if datasets: File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf\\utils.py:668, in docval..dec..func_call(*args, **kwargs) 666 def func_call(*args, **kwargs): 667 pargs = _check_args(args, kwargs) --> 668 return func(args[0], **pargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\hdmf_zarr\\backend.py:522, in ZarrIO.write_group(self, **kwargs) 520 group = parent[builder.name] 521 else: --> 522 group = parent.require_group(builder.name) 524 subgroups = builder.groups 525 if subgroups: File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\hierarchy.py:1025, in Group.require_group(self, name, overwrite) 1000 def require_group(self, name, overwrite=False): 1001 \"\"\"Obtain a sub-group, creating one if it doesn't exist. 1002 1003 Parameters (...) 1022 1023 \"\"\" -> 1025 return self._write_op(self._require_group_nosync, name, overwrite=overwrite) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\hierarchy.py:952, in Group._write_op(self, f, *args, **kwargs) 949 lock = self._synchronizer[group_meta_key] 951 with lock: --> 952 return f(*args, **kwargs) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\hierarchy.py:1032, in Group._require_group_nosync(self, name, overwrite) 1030 # create terminal group if necessary 1031 if not contains_group(self._store, path): -> 1032 init_group( 1033 store=self._store, path=path, chunk_store=self._chunk_store, overwrite=overwrite 1034 ) 1036 return Group( 1037 self._store, 1038 path=path, (...) 1043 zarr_version=self._version, 1044 ) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\storage.py:677, in init_group(store, overwrite, path, chunk_store) 674 store[\"zarr.json\"] = store._metadata_class.encode_hierarchy_metadata(None) # type: ignore 676 # initialise metadata --> 677 _init_group_metadata(store=store, overwrite=overwrite, path=path, chunk_store=chunk_store) 679 if store_version == 3: 680 # TODO: Should initializing a v3 group also create a corresponding 681 # empty folder under data/root/? I think probably not until there 682 # is actual data written there. 683 pass File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\storage.py:738, in _init_group_metadata(store, overwrite, path, chunk_store) 736 key = _prefix_to_group_key(store, _path_to_prefix(path)) 737 if hasattr(store, \"_metadata_class\"): --> 738 store[key] = store._metadata_class.encode_group_metadata(meta) 739 else: 740 store[key] = encode_group_metadata(meta) File c:\\Users\\heberto\\miniconda3\\envs\ euro\\Lib\\site-packages\\zarr\\storage.py:1139, in DirectoryStore.__setitem__(self, key, value) 1137 except OSError as e: 1138 if e.errno != errno.EEXIST: -> 1139 raise KeyError(key) 1141 # write to temporary file 1142 # note we're not using tempfile.NamedTemporaryFile to avoid restrictive file permissions 1143 temp_name = file_name + \".\" + uuid.uuid4().hex + \".partial\" KeyError: 'acquisition/Name: name/.zgroup'" }

Whereas this works prefectly:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries

nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

from hdmf_zarr.nwb import NWBZarrIO
nwbfile_path = "test.nwb"

with NWBZarrIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)
oruebel commented 1 month ago

Thanks for reporting this issue. Unfortunately I don't think there is much we can do about this. ":" is not allowed in file names on Windows, and since Zarr translates the TimeSeries to a folder on disk, the name must be a valid name.

h-mayorquin commented 1 month ago

Perhaps we can make user experience better?

A similar error is hdf5 limitation about slashes in names:

from pynwb.testing.mock.file import mock_NWBFile
from pynwb.image import TimeSeries

nwbfile = mock_NWBFile()
time_series = TimeSeries(name="Name/name", rate=1.0, data=[1, 2, 3], unit="unit")
nwbfile.add_acquisition(time_series)

nwbfile_path = "test.nwb"

from pynwb import HDMFIO

with HDMFIO(path=nwbfile_path, mode="w") as io:
    io.write(nwbfile)

Which outputs a value error:

ValueError: name 'Name/name' cannot contain '/'

Perhaps we could have something similar?

More deeply this seems like a leaky abstraction where an implementation detail (the naming scheme in the backend) is leaked to something that the user can control directly. It is an interesting coupling.

oruebel commented 1 month ago

Perhaps we can make user experience better?

A similar error is hdf5 limitation about slashes in names:

Adding an error check in ZarrIO to improve reporting sounds reasonable. We would need to add a check for both groups and datasets, so we'd probably need a helper function to check if a path-name is permitted.

More deeply this seems like a leaky abstraction where an implementation detail (the naming scheme in the backend) is leaked to something that the user can control directly. It is an interesting coupling.

There are certain limitations of storage backends that are tricky to work around. One approach could be to have docval validate names when creating new objects (i.e., before we call write) to prevent creation of bad names.

h-mayorquin commented 1 month ago

Adding an error check in ZarrIO to improve reporting sounds reasonable. We would need to add a check for both groups and datasets, so we'd probably need a helper function to check if a path-name is permitted.

Make sense.

Thinking about this, this raises other questions: 1) Should we disallow the use of ":" in names for other OSes as well (not just windows)? Currently, you can build a zarr file in Linux that can't be moved to Windows, right? that seems like undesirable. 2) Should it be disallowed for other backends? Otherwise, you can build an NWBFile with hdf5 backend that can't be repacked to zarr.

oruebel commented 1 month ago

Thinking about this, this raises other questions:

1. Should we disallow the use of ":" in names for other OSes as well (not just windows)? Currently,  you can build a zarr file in Linux that can't be moved to Windows, right? that seems like undesirable.

2. Should it be disallowed for other backends? Otherwise, you can build an NWBFile with hdf5 backend that can't be repacked to zarr.

@rly @bendichter thoughts on this?

bendichter commented 1 month ago

I think for 1 yes, we should prohibit ":" on other operating systems, as I would argue cross-OS compatibility is a core feature of NWB.

For 2, I think maybe not. I don't think interoperability between backends is necessarily a core feature, though perhaps we should put this as an NWB Best Practice.

there's another question that comes up for me:

  1. Should we disallow extensions that have ":" in the name? I think the answer is that ":" should be disallowed from pinned names in extensions, because I think we would like these extensions to be compatible with all the supported backends
rly commented 1 month ago

I agree with @bendichter on 1 and 3 and am ambivalent on 2. Allowing but discouraging people from putting ":" in the name feels a little in conflict with preventing people from requiring the name to have a ":". I understand the reason.

A simpler rule is to just not allow ":" in the name for all backends moving forward. I would say most, if not all, people do not care much about the names of their NWB objects.

h-mayorquin commented 1 month ago

I think that one advantage of disallowing :, / in general is that it would be easier to maintain and communicate. Just a plain rule instead of diffs and code spread across the two backends.

bendichter commented 1 month ago

My thoughts on (2): I'm thinking of a use-case where the user wants to write HDF5-based NWB files. They want to use object names that reflect their source data structure in some specific way that includes the : character. Then we say they can't and they ask why and we say "because it won't work for Zarr." They might be frustrated by that answer because they don't care about using Zarr anyway. I would consider cross-language and cross-OS support to be core features of NWB, so it makes sense to constrain for those, but cross-backend support doesn't seem as essential.

On the other hand, we may down the line decide it's better to store all NWB data on DANDI as Zarr objects, and it might then be much nicer if we already had constraints in place that allowed us to do that more easily.

I see some advantages and disadvantages. I tend to err on the side of freedom, since you can never fully anticipate every use-case and you are potentially causing a big headache for someone by making constraints you don't need to.