wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
531 stars 76 forks source link

Documentation Enhancement: Custom comparators are required to accept all bytestrings #117

Open trenta3 opened 4 years ago

trenta3 commented 4 years ago

Hi! First of all I thank you for the great project.

Summary

I was trying to use plyvel with custom comparators to build another project on top of LevelDB. I found that custom comparators are passed "all" bytestrings when used in a prefix iterator, and this can give problems if they expect a custom format for keys. I think it is interesting to report to the project and some mention of it should be added in the documentation.

Detailed Description

To give more structure to the keys I wanted to have them as tuples, that are ordered lexicographically in python. In this way I could have a multi-level index thanks to the custom ordering.

I'm using msgpack to pack the tuples into bytestrings and the bytestrings back to tuples inside of the custom comparator to unpack them.

def comparator(a: bytes, b: bytes) -> int:
    """
    We compare data as msgpack structures.
    We deserialize lists which have the lexicographic order.
    """
    at, bt = unpackb(a), unpackb(b)
    if at < bt:
        return -1
    if at > bt:
        return 1
    return 0

If I use a prefix iterator (in certain cases) on a database with such comparator I get the traceback:

Traceback (most recent call last):
  File "test3.py", line 15, in comparator
    at, bt = unpackb(a), unpackb(b)
  File "msgpack/_unpacker.pyx", line 206, in msgpack._cmsgpack.unpackb
ValueError: Unpack failed: incomplete input
FATAL ERROR: Exception raised from custom Plyvel comparator
Aborting to avoid database corruption...
Aborted (core dump creato)

It took me quite a bit to get what the problem is, but I finally found it.

Minimal Code Sample

My virtualenv (python 3.7.7):

attrs==19.3.0
hypothesis==5.19.1
msgpack==1.0.0
plyvel==1.2.0
sortedcontainers==2.2.2

The code responsible for the error, with debugging print statements inserted:

import plyvel
from hashlib import sha256
from msgpack import unpackb, packb
from unittest import TestCase, main
from tempfile import mkdtemp
from shutil import rmtree

def comparator(a: bytes, b: bytes) -> int:
    """
    We compare data as msgpack structures.
    We deserialize lists which have the lexicographic order.
    """
    print("COMPARE", a, b)
    at, bt = unpackb(a), unpackb(b)
    if at < bt:
        return -1
    if at > bt:
        return 1
    return 0

class TestDBInsertion(TestCase):
    def setUp(self):
        super().setUp()
        self.dbroot = mkdtemp(prefix="plyvel")
        self.db = plyvel.DB(self.dbroot, create_if_missing=True, comparator=comparator,
                            comparator_name=b"msgpack_tuple_comparator")

    def tearDown(self):
        super().tearDown()
        self.db.close()
        rmtree(self.dbroot)

    def test_exists_after_insertion(self, bstring: bytes = b"\xfe\"\xfe"):
        print("BSTRING", bstring)
        # Generate the key using an hash function and a random number
        hashfun = sha256()
        hashfun.update(bstring)
        stringhash = hashfun.digest()[24:]
        print("HASH", stringhash)
        key = packb([stringhash])
        self.db.put(key, bstring)
        print("INSERTED!")
        # Now we test that there is something under the first path
        subobjects = list(self.db.iterator(prefix=packb([stringhash])))
        self.assertEqual(len(subobjects), 1)

if __name__ == "__main__":
    main()

What it prints when executed:

BSTRING b'\xfe"\xfe'
HASH b'\x89=\xcf\xf9\x90q\x9b\xff'
INSERTED!
COMPARE b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9b\xff' b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9b\xff'
COMPARE b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9b\xff' b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9b\xff'
COMPARE b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9b\xff' b'\x91\xc4\x08\x89=\xcf\xf9\x90q\x9c'
Traceback (most recent call last):
  File "test3.py", line 15, in comparator
    at, bt = unpackb(a), unpackb(b)
  File "msgpack/_unpacker.pyx", line 206, in msgpack._cmsgpack.unpackb
ValueError: Unpack failed: incomplete input
FATAL ERROR: Exception raised from custom Plyvel comparator
Aborting to avoid database corruption...
Aborted (core dump creato)

Notice that the hash of the bytestring ends in \xff. It is then clearly visible in the sequence of compare how the second member is the "end of the range" in the last line, which is however invalid msgpack format, so that the comparator function crashes.

Questions

I would like to ask a few questions, about the possible fixes that come to my mind about the issue (my usecase in particular):

1. The ideal approach

I understand the need for the iterator to compare the first string outside of the defined range to check if other elements remain to be iterated, even if it was not obvious at first and I think it should be at least mentioned in the documentation.

That said, I think that a good approach would be to let the user define also a custom next function which returns the first key after the end of the prefix-range so that one is allowed to use only a subset of keys and can then use custom encodings in them. I understand that there may be issues in the implementation of LevelDB, since this is only a wrapper.

So the question: Do you think the above is feasible?

2. The practical approach

I would like to be assured that this is the only case in which a key outside my range is presented to comparison (of course assuming that only valid keys are inserted into the database). If it is so, then I would simply capture the exception and know that this is just a key following a bytestring ending in \xff, recover the real next tuple and compare this with the other key.

The question here is: Is this the only case in which I can be given a key outside my range?

Suggested changes in documentation

I think that mentioning that when calling iterators, keys outside of the user inserted coding can occur in comparison. In particular it would be great to add something to the Custom Comparator Documentation, when the actual occurrences of this behaviour have been clarified.

Thank you in advance for your time.

wbolster commented 4 years ago

thanks for the detailed report. i understand your issue. the "next value" for byte strings is plyvel, not leveldb, and used only for the prefix= iterator.

a pluggable function could be one option indeed.

another one is to not use a custom comparator at all, since those come at write a performance cost due to the many crossings between python and c++, including outside the python program's control since leveldb compacts in a background thread. (hence the fatal error + abort, there is no python stack to raise from in all cases).

perhaps another key design can achieve the same? what do your exactly need?

(also i'm travelling without a computer currently so didn't reproduce your samples)

trenta3 commented 4 years ago

Thanks for the quick reply. I know about the performance problems with custom comparators, but they provide rapid feedback on some ideas which is incredibly valuable for rapid prototyping.

Concerning the key design I'm using pack and unpack because I need to be able to use arbitrary-length tuples of data (strings, integers, floats and bools) and I'm using the lexicographic ordering for tuples to build indexes (I'm trying to build a more complex database on top of LevelDB).

For example, if our records are posts (with fields author, year, title, content) coded as dictionaries we can insert them under the tuple-key ("posts", post_uuid). Then we can build an index that stores both authors and years using tuple-keys of the form ("index", author, year, post_uuid) so that since the index is ordered lexicographically it is efficient to search by author, and also by year if the author is fixed (not the best example for it but in some cases you create just one index instead of two).

I'm currently thinking of a way to encode keys to preserve their lexicographic ordering and I think I will take that route at the end, but I still think that being able to pass a pluggable function for next value would be valuable in other cases in which not all bytestrings are valid key encodings.

If you agree on the solution, I can also send a Pull Request that adds a pluggable function for the next value, but I will have to learn a bit more cython before :)