wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
530 stars 75 forks source link

Consider adding lower-level iterator controls available in LevelDB #17

Closed tommetge closed 10 years ago

tommetge commented 10 years ago

Specifically:

This is a completely different model of iteration from standard python. As mentioned here:

https://github.com/tommetge/plyvel/commit/25364fc6e8395aee84143f5733a0fc98e1a72000

... we have an existing code base whose logic depends on the outlined features. It may not be appropriate for the general Plyvel consumer.

Let me know what you think.

jtolio commented 10 years ago

also, we need to support closing iterators, so that we don't have to wait for garbage collection to kick in to free the leveldb snapshot backing the iterator

wbolster commented 10 years ago

The ability to close iterators explicitly is actually an issue on itself. I've just opened #19 for that.

jtolio commented 10 years ago

thx

wbolster commented 10 years ago

I have a few remarks on the initial report.

seek() should actually perform a seek (this is not the case right now)

That is not correct. Iterator.seek() does seek (by calling LevelDB's Iterator.Seek()); it just does not return anything.

allow callers to check if the iterator is currently valid (LevelDB::Iterator::Valid())

This depends on how the iterator is used: next(...) might be valid, but .prev() might not (or any other combination).

allow callers to manually step the iterator forward and backward

It seems to me the same can be achieved using next(...) or .prev() (and ignoring the result).

allow callers to fetch the key or value for a current iterator

Again, this depends on how the iterator is used (see above).

This is a completely different model of iteration from standard python.

Indeed, Python has an (imho very sophisticated) iteration interface. I've spent quite some time transforming the LevelDB iterator interface to conform to Python's conventions. The result is that the underlying code is a state machine instead of a straight-forward API bridge. Mixing this already quite hairy code with additional Python-level access to the underlying LevelDB concepts is, at best, a daunting task. I'm especially worried about all the corner cases combining things like prefixed databases, the .iterator(prefix=...) arg, and all combinations of start, stop, include_start, include_stop, and prefix. Getting this (hopefully completely) right took quite some effort; see e.g. #4, #9, #15.

tommetge commented 10 years ago

Thanks for the response. One correction in my original report: it's not seek() that doesn't move the iterator, it's seek_to_start() and seek_to_stop() that do not. These calls defer actual movement of the iterator to real_next() and real_prev().

The fact that seek() operates as expected (and moves the iterator position in LevelDB) makes the case for exposing Valid() a little stronger because, as you noted, next() and prev() can return entirely different things, depending on the database contents. The ambiguity can lead to confusion when implementing seek() (i.e. should i use next or prev here?).

I think that seek() really doesn't fit the Python iterator idiom in general - there isn't such a concept in the iterator protocol. The Python model would accommodate a seek by range-based iterating, which is already implemented.

Manually stepping with next() and prev() is totally fine. We've implemented it precisely that way, wrapping it in a try/except block to emulate an exception-free step.

I think the core concern here is that Plyvel is an excellent Python interface for LevelDB's C library - but our code makes some assumptions about seek(), seek_to_start(), and seek_to_stop() that aren't respected in Plyvel and are incompatible with the Python concept of iteration. Given this, if we can't find a good fit for these features here, I'm totally fine maintaining a fork or branch of Plyvel that implements these lower-level features that never sees use outside of our (rather unique) code base.

jtolio commented 10 years ago

the only differences i really am aware of between python iterators and leveldb iterators are that leveldb-next advances the iterator once, whereas python-next saves the current key and value, advances the iterator, and then returns the saved key and value. prev is similar but in the opposite order. basically the difference between i++ and ++i, right?

i feel like it should be fine to let both semantics coexist defaulting to python semantics, provided that there's calls to the lower level leveldb semantics-style calls?

maybe i'm just talking out of my rear end again, as i'm not as familiar with plyvel as either of you two. i'm also aware wouter had some questions about the leveldb-py interface (it seemed weird to him), but i think it's possible for both semantics to coexist.

for someone looking for a fast leveldb library in python, it's possible they'll want to use leveldb semantics, in the case they're transliterating code from a different language or something.

wbolster commented 10 years ago

leveldb-next advances the iterator once, whereas python-next saves the current key and value, advances the iterator, and then returns the saved key and value. prev is similar but in the opposite order

Not exactly. next() in Python first calls Next() in C++, then returns the current value. .prev() in Python remembers the current value, calls Prev() in C++, then returns the remembered value.

However, both next() and .prev() have quite some logic to correctly handle all the start, stop, _includestart, _includestop variations. Allowing direct calls to Next() and Prev() on the C++ level could easily mess up this carefully crafted logic.

Example: Next() might succeed on the C++ level, but if it's called on a PrefixedDB instance, the iterator could still be invalid, even though the underlying C++ iterator is still valid.

i'm also aware wouter had some questions about the leveldb-py interface (it seemed weird to him)

Well, not exactly 'weird', but mostly not reallly Pythonic. With Plyvel, all common iteration patterns can be expressed using a common for loop with the right combination of keyword args to DB.iterator(), while in leveldb-py calling code needs to interact with the iterator instance instead of just using a for loop.

wbolster commented 10 years ago

Fwiw, I would not mind adding a "low level iterator" API to Plyvel, but I'd like to keep it separate from the Pythonic iterator API with all its quirks and corner cases, as outlined before.

What do you think of a db.raw_iterator() API returning a RawIterator instance that almost completely mimics the C++ API? It would not have support for the all the fancy flags (and would not support the "prefixed database" feature either).

jtolio commented 10 years ago

that would be fantastic.

wbolster commented 10 years ago

What is your opinion on .key() and .value() methods versus return values from .next() and .prev() (and maybe others)? And should there be a .valid() method, or should the code handle this inside the other methods?

— Wouter

(Sent from my tablet. Please ignore the typos.)

JT notifications@github.comschreef:

that would be fantastic.

— Reply to this email directly or view it on GitHub.

jtolio commented 10 years ago

whoops, i suck at email once again. i'm on freenode as 'jtolds' btw

the benefit to having separate key and value methods is that you don't have to call the underlying c method if you don't need to, reducing memory copies. it also helps makes code more general. you can have some chunk of code decide where to seek the iterator to, and another chunk of code to handle the iterator post seeking.

i also think a valid method is useful, cause try/except blocks around every seek/step seem like worse code to me, but i guess that's more subjective

wbolster commented 10 years ago

I've just implemented a RawIterator (published in the raw-iterator feature branch for now). Commit 8d0bfa600f01d1fef98616739d1f5a681c72d13a lays the foundation by moving some of the shared iterator logic into a base class, and commit 2cd3101ac1f6bcc31132d7c465804540940f6e73 adds the new RawIterator API on top of that.

@tommetge, @jtolds: please review these changes. The tests provide example API usage; the implementation itself shows the exposed API.

wbolster commented 10 years ago

Relevant parts of the commit message:

Implement low-level "raw iterator" API

Added a .raw_iterator() method to DB and Snapshot instances. The
RawIterator instance returned by .raw_iterator() mimics the C++ iterator
API provided by LevelDB. The API is almost a direct mapping, except for
some added error handling, which is required to avoid segfaults or other
aborts from within LevelDB.
jtolio commented 10 years ago

looks good to me

wbolster commented 10 years ago

Okay, so this just needs docs. I'll look into that later.

— Wouter

(Sent from my phone. Please ignore the typos.)

JT notifications@github.com schreef:

looks good to me


Reply to this email directly or view it on GitHub: https://github.com/wbolster/plyvel/issues/17#issuecomment-28438007

wbolster commented 10 years ago

Just released Plyvel 0.7 with this new API. https://plyvel.readthedocs.org/en/latest/news.html#plyvel-0-7

jtolio commented 10 years ago

awesome, thx wouter