uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
270 stars 59 forks source link

Very slow when creating an LMDB database #6

Closed Mottotime closed 8 years ago

Mottotime commented 8 years ago

Hi, I've followed the instructions in Creating an LMDB database in Python, which is a very helpful post. However I found it would take more than 10 minutes to write less than 10,000 images into an lmdb file.

The map_size was set as 1TB.

Is there any way to accelerate the processing?

gustavla commented 8 years ago

I ran some benchmarks and there is one thing you can do to speed up the example. Write multiple entries as part of the same transaction. That is, move with env.begin(write=True) as txn: outside of the for-loop:

with env.begin(write=True) as txn:
    for i in range(N):
        datum = caffe.proto.caffe_pb2.Datum()
        datum.channels = X.shape[1]
        datum.height = X.shape[2]
        datum.width = X.shape[3]
        datum.data = X[i].tobytes()  # or .tostring() if numpy < 1.9
        datum.label = int(y[i])
        str_id = '{:08}'.format(i)

        # The encode is only essential in Python 3
        txn.put(str_id.encode('ascii'), datum.SerializeToString())

Saving 10,000 images of size (3×224×224 bytes) took 6 seconds, as opposed to 110 seconds using the original version. Not sure if it's a good idea to make transactions huge, so you might want to try buffering them.

Thanks for pointing this out, I will update the blog post with this information.

Mottotime commented 8 years ago

Thank you very much.