saeyslab / napari-sparrow

Other
17 stars 0 forks source link

Problem reading SpatialData on Windows #127

Closed julienmortier closed 1 year ago

julienmortier commented 1 year ago

I experienced an issue when trying to read in a SpatialData object that was made on the VSC in Windows (the problem does not occur on mac). There seems to be a problem with the parquet file. When manually removing the Points folder, I am able to read in the sdata.

sdata = sd.read_zarr(r'E:\benchmarking_mouse_brain\results\Resolve\0/sdata.zarr')


ArrowInvalid Traceback (most recent call last) File c:\Users\julienm\miniconda3\envs\napari-sparrow\lib\site-packages\dask\backends.py:136, in CreationDispatch.register_inplace..decorator..wrapper(*args, *kwargs) 135 try: --> 136 return func(args, **kwargs) 137 except Exception as e:

File c:\Users\julienm\miniconda3\envs\napari-sparrow\lib\site-packages\dask\dataframe\io\parquet\core.py:543, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs) 541 blocksize = None --> 543 read_metadata_result = engine.read_metadata( 544 fs, 545 paths, 546 categories=categories, 547 index=index, 548 use_nullable_dtypes=use_nullable_dtypes, 549 dtype_backend=dtype_backend, 550 gather_statistics=calculate_divisions, 551 filters=filters, 552 split_row_groups=split_row_groups, 553 blocksize=blocksize, 554 aggregate_files=aggregate_files, 555 ignore_metadata_file=ignore_metadata_file, 556 metadata_task_size=metadata_task_size, 557 parquet_file_extension=parquet_file_extension, 558 dataset=dataset_options, ... 141 f"Original Message: {e}" 142 ) from e

ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend. Original Message: Error creating dataset. Could not read schema from 'E:/benchmarking_mouse_brain/results/Resolve/0/sdata_copy.zarr/points/transcripts/points.parquet/._part.0.parquet': Could not open Parquet input source 'E:/benchmarking_mouse_brain/results/Resolve/0/sdata_copy.zarr/points/transcripts/points.parquet/._part.0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

SilverViking commented 1 year ago

There is no problem with the SpatialData zarr files created on the HPC. Those exact files can also be opened successfully on Windows.

However, what happened is that a Mac was used to copy the zarr files to a Windows formatted drive. During this copy MacOS will save Mac-specific file metadata which cannot be stored in the target file system - in this case for the parquet files - into separate files. These files start with ._ but also have the .parquet files extension. If we then try to read the zarr file, the parquet library will encounter these files, and because of .parquet extension will assume they are valid parquet files. It then tries to read them and fails because they are not actually parquet files. (They indeed do not have the parquet magic bytes in their header, but instead have the AppleDouble header bytes.)

In the error described above, the .zarr file has a parquet folder ...\Resolve\0\sdata.zarr\points\transcripts\points.parquet with contents:

._part.0.parquet (4 KB)    <-- resource fork, not actually a parquet file
part.0.parquet (31 MB)     <-- actual parquet file
...

These ._*.parquet files must be deleted in order to be able to read the .zarr on Windows. Or better still, they should not have been created there in the first place.

See also the AppleDouble documentation for why MacOS creates these ._ files.

SilverViking commented 1 year ago

Closing this issue. It's a side effect of some copy operations on MacOS, not a bug in sparrow. They problem also does not occur when running sparrow on a single platform. And there is an easy workaround if it does occur.