Duplicated keys fail with append=True but works fine with append=False

rickbeeloo commented 3 years ago

Affected Operating Systems

Linux

Affected py-lmdb Version

1.0.0

py-lmdb Installation Method

pip install lmdb

Describe Your Problem

Short Description Actually, I had another problem however while making the reproducible example I came across another confusing thing. When I create an array, sort it, and then insert the values using append=True only the first occurrence gets inserted correctly, when I change it to append=False it works fine, so this is similar to #235. Or am I missing something here?

We really require the append=False due to speed issues (millions of inserts)

Code

import lmdb
import numpy as np
np.random.seed(1)

# Generate array
arr  = np.random.randint(10, 15, 8)
vals = np.random.randint(0, 100000, 8)

# Sort it
sort_idx = np.argsort(arr)
arr = arr[sort_idx]
vals = vals[sort_idx]

# Set up db
env = lmdb.open('test.lmdb', max_dbs = 1)
db = env.open_db(dupsort=True, integerkey=True, integerdup=True)

# Insert
pack_item = lambda x: x.to_bytes(4, 'big')
with env.begin(db=db, write=True) as t:
    for i in range(len(arr)):
        print(arr[i])
        key = pack_item(int(arr[i]))
        val = pack_item(int(vals[i]))
        print(t.put(key,val,append= True))

This would give the following output:

10
True
10
False
10
False
11
True
11
False
13
True
13
False
14
True

When I set append=False:

10
True
10
True
10
True
11
True
11
True
13
True
13
True
14
True

Describe What You Expected To Happen

That put with append=True would work fine on already sorted values

Describe What Happened Instead

See above

jnwatson commented 3 years ago

It might not make sense, but at least it is documented as such:

Appending a key that is not greater than the highest existing key will fail and return False.

rickbeeloo commented 3 years ago

I indeed read that, that's why I was confused about #235 where you stated:

The problem is that for append=True and tuples with duplicate keys, one must pass a separate flag called MDB_APPENDDUP

and

I'll fix it. I may have to add a new argument to all the put methods.

Implying that this did seem possible?, if not I don't get what did get fixed in https://github.com/jnwatson/py-lmdb/commit/7c646047b1e96b51d945d19cf976f6750dc734a4 ?

jnwatson commented 3 years ago

You raise a good point. I fixed putmulti but not put. I'll fix it.

jnwatson commented 3 years ago

Upon further analysis, there's still an issue with your test. You must sort the vals with your array. Here's an example that uses putmulti to get around the issue with put. You should probably use putmulti (with more than one pair at a time) anyways if you are optimizing for performance.

import lmdb
import numpy as np
np.random.seed(1)

arr  = np.random.randint(10, 15, 8)
vals = np.random.randint(0, 100000, 8)

sort_idx = np.argsort(arr)
arr = arr[sort_idx]
vals = vals[sort_idx]

both = sorted(zip(arr, vals))

env = lmdb.open('test.lmdb', max_dbs = 1)
db = env.open_db(b'db', dupsort=True, integerkey=True, integerdup=True)

pack_item = lambda x: x.to_bytes(4, 'little')
with env.begin(db=db, write=True) as t:
    for tup in both:
        print(tup)
        key = pack_item(int(tup[0]))
        val = pack_item(int(tup[1]))
        with t.cursor() as curs:
            print(curs.putmulti([(key,val)],append=True))

rickbeeloo commented 3 years ago

1.

you should probably use putmulti (with more than one pair at a time) anyways if you are optimizing for performance.

For the insertion performance, yes, however in our case, the arr and vals are enormous (>200 million items), and therefore using zip will at one point hold not only arr and vals in RAM but also the zipped array, therefore, using >2x more memory than directly inserting using put (around 2.5% RAM vs 10% in our test). Therefore it would be awesome if it would be fixed for put too? ;)

2. How can I properly use get here? This is probably a stupid question... but I cant figure out how to get the value based on the key here

env = lmdb.open('test.lmdb')
with env.begin() as txn:
    print(txn.get(pack_item(10)))

This will return None, perhaps I should specify the db somewhere?

3. Open cursor over and over again? I see that with t.cursor() as curs: is called within the for loop, is this required? in our case, we then would open the cursor millions of times wouldn't that introduce lots of overhead?

jnwatson commented 3 years ago

I will definitely fix the put. I'm just trying to provide a workaround. You can structure the array so you're keeping the keys and vals together. Inserting 10000 key/vals at a time might provide a 100x perf increase.

See how I open the txn above with env.begin(db=db, write=True). You must specify the database (essentially, a separate table) when you open the transaction.

rickbeeloo commented 3 years ago

Integer insert does not work with to_bytes(5, 'little') I'm checking this for our "real dataset" and notice that the above code also only works with pack_item = lambda x: x.to_bytes(4, 'little') but not with pack_item = lambda x: x.to_bytes(5, 'little') - which we require as our integers are bigger. Notably, this apparently was also an issue for https://github.com/jnwatson/py-lmdb/issues/280. What could be the problem here?

Does seem to work when converting the integer to a string and use that as key.... Interestingly, it does not work when we use integer keys, however, it does when we change from integer to strings, so in the above code change this line: key = str(tup[0]).encode('utf-8')

jnwatson commented 3 years ago

All entries in that database must either be 4 or 8 bytes in length.

rickbeeloo commented 3 years ago

You closed this issue (again) but I don't think put is fixed already?

Aaah thanks! Was not aware of that, is it somewhere in the doc?

jnwatson commented 3 years ago

integerkey: If True, indicates keys in the database are C unsigned or size_t integers encoded in native byte order. Keys must all be either unsigned or size_t, they cannot be mixed in a single database.

jnwatson / py-lmdb