hyperspy / rosettasciio

Python library for reading and writing scientific data format
https://hyperspy.org/rosettasciio
GNU General Public License v3.0
52 stars 28 forks source link

Using zspy with database-format? #249

Open magnunor opened 8 months ago

magnunor commented 8 months ago

The go-to file format for saving large files in HyperSpy is currently .zspy. It uses the Zarr library, to (by default) save the individual chunks in a dataset as individual files. This is through zarr.NestedDirectoryStore. Since the data is stored in individual files, python can both write and read the data in parallel. This makes it much faster compared to for example HDF5-files (.hspy).

However, one large downside with this way of storing the data, is that one can end up with several 1000 of individual files nested within a large number of folders. Sharing this with other people directly is tricky. While it is possible to zip the data, the default zip-reader/writer in Windows seems to struggle if the number of files becomes too large. In addition, it is tedious if the receiver has to uncompress the data before they can visualize it.

Zarr has support for several database formats, where some of them can handle parallel reading and/or writing. With this, it should be possible to get the parallel read/write, while simultaneously getting only one or two files.

I am not at all familiar with these types of database formats. So I wanted to see how they performed, and if they could be useful for working on and sharing large multidimensional datasets.

File saving

Making the dataset

import dask.array as da
import hyperspy.api as hs
dask_data = da.zeros(shape=(400, 400, 200, 200), chunks=(50, 50, 50, 50))
dask_data[:, :, 80:120, 80:120] = da.random.random((400, 400, 40, 40))
s = hs.signals.Signal2D(dask_data).as_lazy()

Saving the datasets:

import zarr
##########################
store = zarr.LMDBStore('001_test_save_lmdb.zspy')
s.save(store)

##########################
store = zarr.NestedDirectoryStore('001_test_save_nested_dir.zspy')
s.save(store)

##########################
store = zarr.SQLiteStore('001_test_save_sqldb.zspy')
s.save(store)

File loading

Then loading the same datasets

from time import time
import zarr
import hyperspy.api as hs

Note: run these separately, since the file is pretty large.

t0 = time()
store = zarr.LMDBStore("001_test_save_lmdb.zspy")
s = hs.load(store)
print("LMDB {0}".format(time() - t0))
t0 = time()
store = zarr.NestedDirectoryStore("001_test_save_nested_dir.zspy")
s = hs.load(store)
print("NestedDirectory {0}".format(time() - t0))
t0 = time()
store = zarr.SQLiteStore('001_test_save_sqldb.zspy')
s = hs.load(store)
print("SQLite {0}".format(time() - t0))

Results:

ericpre commented 8 months ago

Can you be more specific with the issue on windows? Does it have to do with the number of files per directory, the nested structure of the directories or the specific software being used on windows? I had issues with path length on windows but this can fixed easily with some windows setting.

I usually copy folder without zipping and it works fine when synchronising using Dropbox, Onedrive, Nextcloud, etc. What are you using to share the data?

CSSFrancis commented 8 months ago

@magnunor Another thing to consider is if windows is trying to compress the data even further. I think for linux systems it checks to see if the underlying data is compressed and won't "double" compress the data but it's fairly possible that windows doesn't handle that case nearly as well.

CSSFrancis commented 8 months ago

Something that I've been meaning to try is using a S3 like file system and the FSStore class. People seem to really like that for partial reads over a network which might be of interest.

Another thing to consider is that the v3 specification includes support for "sharding" which should be quite interesting as well and improves the performance for windows computers I think.

magnunor commented 6 months ago

I usually copy folder without zipping and it works fine when synchronising using Dropbox, Onedrive, Nextcloud, etc. What are you using to share the data?

Internal sharing is fine, but for example Zenodo or our website-based filesender can't handle folder-structures (at least not easily).


I tested this a bit more, and the ZipStore seems to perform pretty good:


The code

Saving the data:

from time import time
import zarr
import dask.array as da
import hyperspy.api as hs

dask_data = da.zeros(shape=(400, 400, 200, 200), chunks=(50, 50, 50, 50))
dask_data[:, :, 80:120, 80:120] = da.random.random((400, 400, 40, 40))

s = hs.signals.Signal2D(dask_data).as_lazy()

###########################
t0 = time()
store = zarr.NestedDirectoryStore('001_test_save_nested_dir.zspy')
s.save(store)
print("NestedDirectory store, save-time: {0}".format(time() - t0))

##########################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s.save(store)
print("ZIP store, save-time: {0}".format(time() - t0))

Loading the data:

from time import time
import zarr
import hyperspy.api as hs

##############################
t0 = time()
store = zarr.NestedDirectoryStore("001_test_save_nested_dir.zspy")
s = hs.load(store)
print("NestedDirectory {0}".format(time() - t0))
"""

##############################
t0 = time()
store = zarr.ZipStore('001_test_save_zipstore.zspy')
s = hs.load(store)
print("ZIP {0}".format(time() - t0))