mtanski / dbkit

Columnar query processing engine in Rust
31 stars 5 forks source link

Still working on this? #6

Open jonathanstrong opened 6 years ago

jonathanstrong commented 6 years ago

Hey -

I've been looking for examples of how to write fast querying operations in Rust and came across this. I was interested in whether you are still working on the project, since it hasn't seen a lot of activity lately. I hadn't previously heard of the supersonic query engine - but all of the references I can find are quite dated. Is it still in use? Did it become something else? Thanks for any information you can provide.

mtanski commented 6 years ago

@jonathanstrong I'm looking forward to resuming the work here again in dbkit now that rust supports non-lexical lifetimes. The low level nature of a query engine somehow always contributed to bumping into this; it was just so frustrating. Additionally, since I've last work on dbkit I've researched and explored additional implementation ideas about how to structure the query engine (push/pull) and how to build a JIT (and how to simplify to make this practical). I'll probably jump back in starting next weekend.

If you've like to help with this project or just learn, it be great to have you.

As far as Supersonic goes, it's an internal Google project that they stopped publishing in an external fashion. But it's pretty complete so there's no reason you can't use it in a project it fit into. Also over the years my team and I fixed a number of bugs, made small improvements, most recently ported it to C++14, modernized the build system and finally reworked much of the ownership code (to use unique_ptr / shared_ptr explicitly in the API). The work is here: https://github.com/adfin/supersonic/network take a look at the modernize2 branch.

jonathanstrong commented 6 years ago

I'll definitely check in on this. I'm fairly competent at Rust but new to database design. I've been searching for good places to jump in and learn. Would you recommend getting to know the supersonic codebase (caveat: not fluent in c++) or this library to start? Thanks and good luck on the project?

mtanski commented 6 years ago

I think Supersonic is a good example of a database query engine with some pretty big caveats. It's only the query execution engine (relational algebra / calculus) everything else is excluded. It's high performance and vectorized... but the authors traded of performance for code readability; it's very heavy on templates / pre computation so the C++ is hard to read.

If you're trying to learn how to implement a pretty generic database I think the PostgreSQL code is easiest to read and navigate. I will also point you to a reddit sub I'm trying to get off the ground specifically about database research and implementation: https://www.reddit.com/r/dbms/ There's some good resources there, esp the links to the CMU / PSU course videos.

Finally, I've made some progress after the holidays to start getting dbkit into shape. There's a branch creates an outline for expressions and updates to the current unstable alloc API. https://github.com/mtanski/dbkit/pull/7 .

I'm going to pause the work on expression now, while I'm figuring out / experimenting how to implement execution scheduling. After spending a bunch of time researching I'm pretty confident the right approach is the one outlines in the recent Morsel approach versus explicit Push/Pull models of operations.

Again, I'd love to have you participate. If you're trying to implement Operations / Expressions, I have enough knowledge and pointers to guide you in the right direction.

jonathanstrong commented 6 years ago

the subreddit looks awesome, just subscribed.

Where I want to end up is being able to design custom time series database code inside my applications. Like kdb+ or influx except instead of writing q or querying via http I just interface to the engine in Rust. The reasons I'm interested are 1) databases that I've used (open source) seem to always have massive performance pitfalls lurking, which is a pain and suggests that tailored code will be a win, 2) I am always writing code to sort collections by this or that attribute and I realized this is the same thing databases do, 3) I find it interesting.

I'd be happy to try to help but I'm fearful I've got a long ways to go before it would be useful. If you want to point me at something that would be useful, I'll give it a go.

mtanski commented 6 years ago

It sounds like something I've built at my day job; a distributed TS relational databases (in C++).

I think the best help right now would be implementing maybe a handful of expression. With that we can implement Filter (WHERE clause, if expression evaluates to true) and Compute (so computed output columns like addition) operations.

You can look at the supersonic code to see what expressions they support. I'd start with handful of projecting expression (so we can feed in data) then some IsNull, NotNull, and value comparisons. At from there we can imagine a whole expression tree. Since you know Rust much better then me, I imagine you leverage language features (generics, traits, etc..) to simplify a lot of the implementation.

Like I said after that comes Filter & Compute operations which will I think will mostly straightforward. There's also some room for fun with implementing some of the expressions using SIMD.

jonathanstrong commented 6 years ago

hey - just wanted to check in and let you know this is still on my radar screen. I'm doing a lot of work prototyping a fixed-shape, very customized columnar "database" for my day job and learning a lot in the process. I'll be able to circle back in a little while with the benefit of this and have more to offer the project. I'm using a fork of the bit-set crate for bit vector indexes, which is working very well: https://github.com/jonathanstrong/bit-set. Also doing a lot of benchmarking with rayon and starting with the "faster" simd crate.

Going through your code, first off, hats off for putting a lot into this. It's a big codebase and a lot of work went into it. Secondly, it's very interesting to see the c++ patterns applied to rust so thoroughly. I haven't run across anything quite like it yet. I'm kind of amazed you got all the lifetimes to work, it's a lot of hard work to get the compiler happy. As an aside, prior to rust I was coding in python and using mostly functional style. I learned c++ in high school but have never done anything serious with it, so very different background.

One question I wanted to ask you is, is there any place you have outlined the vision for the project in more detail? In particular, how do you expect people to interact with it? Via rust as a library? With a query language? One reason I ask this is the supersonic code includes very many hard-coded expressions, but generally in rust the idiomatic approach is closures. Potentially that could provide a very ergonomic api to code against, but again, not sure what the expected approach will be.