jnwatson / py-lmdb

Universal Python binding for the LMDB 'Lightning' Database
http://lmdb.readthedocs.io/
Other
646 stars 106 forks source link

Speed benchmark of lmdb reads? #274

Closed 25thbamofthetower closed 3 years ago

25thbamofthetower commented 3 years ago

Affected Operating Systems Linux

Affected py-lmdb Version 0.98

py-lmdb Installation Method pipenv install lmdb

Other important machine info Running in docker container

Describe Your Problem

I was wondering if there are benchmarks for read (and write) that you have and expect to be the basis for the python wrapper? I tried this the following code, and for 41k words, it took about 8 seconds. Then for 600k it took almost a minute. Perhaps I'm not reading from lmdb in the most efficient way?

# lmdb files set to dupsort, and structure is json
# e.g b'foo': [{...}, {...}, ...]
    words = [...]
    start = time.time()
    for word in words:
        for txn in [... ]:
            if txn.get(word.encode('utf-8')) is not None:
                cursor = txn.cursor()
                cursor.set_key(word.encode('utf-8'))
                try:
                    values = [json.loads(v) for v in cursor.iternext_dup()]
                except Exception:
                    pass
                cursor.close()

    print(f'Total time {time.time() - start}')
ddorian commented 3 years ago

@25thbamofthetower can you profile the app and see where it's slow ? try using fewer transactions ?

25thbamofthetower commented 3 years ago

@25thbamofthetower can you profile the app and see where it's slow ? try using fewer transactions ?

I can't use fewer transaction because I have multiple lmdb files that I need to lookup the word in. Each transaction is a lmdb file. I used that small block of code for the test timing I mentioned in the original post.

vEpiphyte commented 3 years ago

you're also measuring string encoding and json decoding.

encoding your keys into binary up front would avoid the overhead of doing them inside of the loop multiple times.

bwords = [word.encode() for word in words]

the json.loads() call is also going to be highly variable from a benchmarking perspective since its unknown what your actual input is.