Seems like does't support Chinese keys bettween named-databases?

zyh3826 commented 2 years ago

Affected Operating Systems

Linux

Affected py-lmdb Version

1.3.0/1.0.0

py-lmdb Installation Method

sudo pip install lmdb

Machine "free -m" output

                           total        used        free      shared  buff/cache   available
Mem:         257421       46163      116679        2694       94578      207699
Swap:         32767       23075        9692

Describe Your Problem

I have some named databases, and some key-values in Chinese, when I insert them into named databases, I find that is not correct, all named databases have the same data, code:

# insert
d = {
    '1999': [['19990012', '动画片']],
    '1114': [['11140004', '动画片'], ['11140011', '冒险']],
    '1101': [['11010020', '冒险']]
}
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(pickle.dumps(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for item in items:
            tag_id, tag_name = item
            key = tag_name.encode()
            val = tag_id.encode()
            sub_type2id_txn.put(key, val)
        print('{} -> {} -> {}'.format(main_type, len(items), sub_type2id_txn.stat()['entries']))
sub_type_env.close()

# iterate
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(pickle.dumps(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for i, j in sub_type2id_txn.cursor():
            print('{}->{}->{}'.format(main_type, i.decode(), j.decode()))
sub_type_env.close()

output:

1999->冒险->11010020
1999->动画片->11140004
1114->冒险->11010020
1114->动画片->11140004
1101->冒险->11010020
1101->动画片->11140004

I try another encode method like pickle, but get the same results, what should I do, thanks a lot

zyh3826 commented 2 years ago

change data to:

d = {
    1999: [[19990012, '动画片']],
    1114: [[11140004, '动画片'], [11140011, '冒险']],
    1101: [[11010020, '冒险']]
}
also, get the wrong output
1999->冒险->11010020
1999->动画片->11140004
1114->冒险->11010020
1114->动画片->11140004
1101->冒险->11010020
1101->动画片->11140004

change data to:

d = {
    1: [[1, '2'], [3, '4'], [5, '6']],
    2: [[3, '4'], [5, '6']],
    3: [[5, '6']]
}
get the correct output:
1->2->1
1->4->3
1->6->5
2->4->3
2->6->5
3->6->5

vEpiphyte commented 2 years ago

It appears that the pickle output producing \x00 characters is negatively interacting with the mdb.c's use of strlen to create the length of the index. Using str.encode() instead of pickle.dumps produces the expected output.

import lmdb
import pickle

def en(s: str) -> bytes:
    # Pickle failse
    # ret = pickle.dumps(s)
    # Encode works
    ret = s.encode()
    parts = ret.split(b"\x00")
    print(f'en ({s}) -> {ret} | {parts}')
    return ret

d = {
    '1999': [['19990012', '动画片']],
    '1114': [['11140004', '动画片'], ['11140011', '冒险']],
    '1101': [['11010020', '冒险']]
}

sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(en(main_type))
    with sub_type_env.begin(write=True, db=db) as sub_type2id_txn:
        for item in items:
            tag_id, tag_name = item
            key = tag_name.encode()
            val = tag_id.encode()
            sub_type2id_txn.put(key, val)
        print('{} -> {} -> {}'.format(main_type, len(items), sub_type2id_txn.stat()['entries']))

sub_type_env.close()

# iterate
sub_type_env = lmdb.open('./test_lmdb', map_size=1000000, max_dbs=1000)
for main_type, items in d.items():
    db = sub_type_env.open_db(en(main_type))
    with sub_type_env.begin(write=False, db=db) as sub_type2id_txn:
        for i, j in sub_type2id_txn.cursor():
            print('{}->{}->{}'.format(main_type, i.decode(), j.decode()))

sub_type_env.close()

jnwatson / py-lmdb