uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
270 stars 59 forks source link

Speed issue #47

Closed esantamariavazquez closed 2 years ago

esantamariavazquez commented 2 years ago

Hi guys!

I just came accross deepdish and I really love it, thank you very much for this great work!

However, I've noticed that there is a speed problem compared to h5py. Here is a very simple piece of code that shows it:

import numpy as np
import deepdish as dd
import h5py
import time

# Create some random data
data = np.random.rand(100000, 128, 8)

# Deepdish
start_dd = time.time()
dd.io.save('test.h5', {'data': data})
finish_dd = time.time()

# H5py
start_h5py = time.time()
hf = h5py.File('test2.h5', 'w')
hf.create_dataset('data', data=data)
hf.close()
finish_h5py = time.time()

print('Time deepdish = %.2f' % (finish_dd - start_dd))
print('Time h5py = %.2f' % (finish_h5py - start_h5py))

In my computer:

I know that the strong point of deepdish is its capability to save complex data (dicts and so on), and perhaps performance is not the key goal here. But still, I think it would be great to achieve similar speed, especially when no complex data is involved.

Whant do you think? It can be done with some optimization?

Cheers, Eduardo

esantamariavazquez commented 2 years ago

I just discovered that I can speedup deepdish disabling the compression with:

dd.io.save('test.h5', {'data': data}, None)

Sorry for the inconvenience, issue closed!

gustavla commented 2 years ago

Glad you discovered the solution! Yes, it is quite a slow compression that is used as default. In <=3.2.0, the default was the much faster Blosc (CHANGELOG.rst). As the changelog suggests, it had some interoperability issues. It was a key requirement that the files saved can (by default) be loaded across any combination of Python 2/3, and Mac/Linux/Windows (e.g. Saved in Python 3 on Mac, load on Python 2 on Linux). I don't remember exactly which combinations Blosc failed at, but I do remember it did. It's too bad, because it is much fast. If interoperability is not a concern for you, Blosc is an excellent choice.

More info here: https://www.pytables.org/usersguide/optimization.html