Closed rickbeeloo closed 3 years ago
It might not make sense, but at least it is documented as such:
Appending a key that is not greater than the highest existing key will fail and return False.
I indeed read that, that's why I was confused about #235 where you stated:
The problem is that for append=True and tuples with duplicate keys, one must pass a separate flag called MDB_APPENDDUP
and
I'll fix it. I may have to add a new argument to all the put methods.
Implying that this did seem possible?, if not I don't get what did get fixed in https://github.com/jnwatson/py-lmdb/commit/7c646047b1e96b51d945d19cf976f6750dc734a4 ?
You raise a good point. I fixed putmulti but not put. I'll fix it.
Upon further analysis, there's still an issue with your test. You must sort the vals with your array. Here's an example that uses putmulti to get around the issue with put. You should probably use putmulti (with more than one pair at a time) anyways if you are optimizing for performance.
import lmdb
import numpy as np
np.random.seed(1)
arr = np.random.randint(10, 15, 8)
vals = np.random.randint(0, 100000, 8)
sort_idx = np.argsort(arr)
arr = arr[sort_idx]
vals = vals[sort_idx]
both = sorted(zip(arr, vals))
env = lmdb.open('test.lmdb', max_dbs = 1)
db = env.open_db(b'db', dupsort=True, integerkey=True, integerdup=True)
pack_item = lambda x: x.to_bytes(4, 'little')
with env.begin(db=db, write=True) as t:
for tup in both:
print(tup)
key = pack_item(int(tup[0]))
val = pack_item(int(tup[1]))
with t.cursor() as curs:
print(curs.putmulti([(key,val)],append=True))
1.
you should probably use putmulti (with more than one pair at a time) anyways if you are optimizing for performance.
For the insertion performance, yes, however in our case, the arr
and vals
are enormous (>200 million items), and therefore using zip
will at one point hold not only arr
and vals
in RAM but also the zipped array, therefore, using >2x more memory than directly inserting using put
(around 2.5% RAM vs 10% in our test).
Therefore it would be awesome if it would be fixed for put
too? ;)
2. How can I properly use get here? This is probably a stupid question... but I cant figure out how to get the value based on the key here
env = lmdb.open('test.lmdb')
with env.begin() as txn:
print(txn.get(pack_item(10)))
This will return None
, perhaps I should specify the db somewhere?
3. Open cursor over and over again?
I see that with t.cursor() as curs:
is called within the for loop, is this required? in our case, we then would open the cursor millions of times wouldn't that introduce lots of overhead?
I will definitely fix the put. I'm just trying to provide a workaround. You can structure the array so you're keeping the keys and vals together. Inserting 10000 key/vals at a time might provide a 100x perf increase.
See how I open the txn above with env.begin(db=db, write=True)
. You must specify the database (essentially, a separate table) when you open the transaction.
Integer insert does not work with to_bytes(5, 'little')
I'm checking this for our "real dataset" and notice that the above code also only works with pack_item = lambda x: x.to_bytes(4, 'little')
but not with pack_item = lambda x: x.to_bytes(5, 'little')
- which we require as our integers are bigger. Notably, this apparently was also an issue for https://github.com/jnwatson/py-lmdb/issues/280. What could be the problem here?
Does seem to work when converting the integer to a string and use that as key....
Interestingly, it does not work when we use integer keys, however, it does when we change from integer to strings, so in the above code change this line:
key = str(tup[0]).encode('utf-8')
All entries in that database must either be 4 or 8 bytes in length.
You closed this issue (again) but I don't think put
is fixed already?
Aaah thanks! Was not aware of that, is it somewhere in the doc?
integerkey: If True, indicates keys in the database are C unsigned or size_t integers encoded in native byte order. Keys must all be either unsigned or size_t, they cannot be mixed in a single database.
Affected Operating Systems
Affected py-lmdb Version
py-lmdb Installation Method
Describe Your Problem
Short Description Actually, I had another problem however while making the reproducible example I came across another confusing thing. When I create an array, sort it, and then insert the values using
append=True
only the first occurrence gets inserted correctly, when I change it toappend=False
it works fine, so this is similar to #235. Or am I missing something here?We really require the
append=False
due to speed issues (millions of inserts)Code
This would give the following output:
When I set
append=False
:Describe What You Expected To Happen
That
put
withappend=True
would work fine on already sorted valuesDescribe What Happened Instead
See above