pgcentralfoundation / pgrx

Build Postgres Extensions with Rust!
Other
3.61k stars 247 forks source link

Inquiry: purpose-built Postgres extensions in the compute-as-a-crate context #1024

Closed tucnak closed 8 months ago

tucnak commented 1 year ago

Hey,

I'm not really a Rust programmer, although I'm a huge Postgres person, and it's becoming harder and harder to ignore pgx these days with ZomboDB, Timescale, PostgresML people capturing the spotlight all at once. The more I learn about Postgres scheduler & aggregation/materialisation engine, the more I am fascinated with the prospects of extending, & further benefiting from within the confines of relational model. Throughout the year I dealt with quite a few unrelated Postgres projects, (all quite interesting in my opinion!) where we ran into various limitations of the available indexes (can be addressed by re-writing random access loads in terms of continuous aggregate tasks) and the aggregate engine, as well as foreign data wrappers among other things.

My work last year gave rise to some fascinating business-motivated ideas for Postgres extensions; I will name a few.

Frequency analysis in the aggregate-continuous model by means of FFT

Some kind of Fourier custom aggregate / index access method would enable us in numerous ways when it comes to systemic errors in the otherwise invariant error streams to help uncover errors that would occur regardless of the immediate usage patterns; this is super relevant for SQL-based observability backends such as Promscale. In my view, this is something that thousands analysts and observability engineers around the world could benefit from. Sure, you can always ETL the data and do the Fourier analysis outside, then push the results back to relational to carry on with your analysis. However, things are starting to get very complicated if you're dealing with large enough dataset (space-partitioned hypertables say hello) or a high-velocity dataset where some kind of near real-time is a requirement.

In this case, what could do the trick is having a PARALLEL SAFE custom aggregate combined with an index method tailored to it for memoization of partial computations, ergo my fft_agg() digests could be set up in the two-step aggregate fashion and could as well be rolled up (i.e. I can take an hourly aggregate and easily produce a daily using rollup(fft_digest) over hourly groups where the digest is a well-indexed result of fft_agg() column. Why bother? Continuous aggregates provide a Postgres-native way to treat materialisations using the watermark logic, whereas any SELECT on it results in a UNION under the hood of the materialised component and live execution of the realtime component. The new data can be set to memoize in bucket increments (see add continuous aggregate policy.)

The key proposition would be to highlight the aggregate-continuous model, built-in deduplication and CPU-bound scale-out capability that Postgres has to offer. FFT is, in fact, quite a parallel problem in both the intragroup and intergroup sense of the word, meaning that you can leverage both precomputed values (available via custom index access method) as well as scatter-gather logic that comes with distributed tables, so individual CPU-bound tasks are performed at the respective data nodes per partition.

We are prepared to spend a lot of money building something like this in open source.

PageRank and similar "recursive" algorithms in the aggregate-continuous model

You can write PageRank using SQL. You could also probably do this using recursive CTEs, although I'm not sure about that. Either way, it's not rocket science and I'm sure many competent analysts did eigenvector algebra in SQL before me, and would have been doing it long after I'm gone. But in any case I'm always looking for a minimize operation that could bring the cost of something operationally to a point where I could easily put it as a part of some subquery, or view, or perhaps simply ORDER BY it for good measure so I can have a metric, sanity checks for it, et cetera. In this case, I couldn't help but wonder if there's a better way to do PageRank in the world of Postgres & Timescale? In the static but also in the dynamic case where new rankings would have to be built almost ad-hoc for a multitude of features, think as a subquery in some materialised view!

I would expect this fully possible but I can't exactly approach it due to lack of perspective from pgx side.

Polars support or some kind of API for table wrangling & function exporting

I've only recently learnt about Polars but already I'm a big fan— for it resolves completely my grievances re: pandas from back in the day I was working the data in Python. In my use case that is PostgresML what I care about and where I'd like to see improvements is feature engineering work. For example, I've realised soon on that hypertables are perfect for storing historical predictions and training sets; with continuous aggregates you also get natural downsampling of past data for continuous training, and many benefits like it.

This means that I'm basically able to do so much more, including but not limited to training, (batch) inference jobs in SQL.

I would also like to do feature engineering using SQL functions, too. The internal kind. Or perhaps in the context of custom access methods— being able to write certain lazy, high-performance implementation of some algorithm for some expression and hopefully being able to leverage it using internal functions and/or operators. Having the Python option (in case of Polars) would be a game-changer for quick prototyping, too.

There are a few missing pieces before I can transfer my complete machine learning stack to Postgres and it's one of them.

Top-k similarity search of chess games and positions

Lichess offers gigabytes upon gigabytes of chess games, and if you were to search for individual positions you would first have to write them down in something like FEN, enumerate said text representations in a position table and put a trigram index with considerable siglen=256 for the necessary speed and precision.

This would in effect allow you to do a top-k similarity search.

Counterpoint: you can have positions that are textually quite different from one another despite being virtually equivalent from the chess logic / opening theory standpoint. The opposite is also true— some positions may be only slightly different textually, albeit radically different in principle. Trigram search is clearly not going to yield optimal results here. I've considered briefly to implement a custom index access method that would do a based on internal Stockfish data structures and evaluations that arise in the alpha-beta pruning. Unfortunately, this project didn't gain traction in the company so they decided not to pour the necessary resources.

I adore the idea still and would love to sponsor an open source project like it.

Follow-up

Please let me know what you think! I have been accumulating these ideas over time despite the fact I'm not well-equipped to validate them proper. Guilty as charged... I know it's kind of a lot, however if there anyone who should be competent in the matter, it's you guys. I know this having talked to Timescale and PostgresML people who've managed to convince me (perhaps unintentionally) that all the nice in the space will probably begin and end with pgx :-)

Best regards Ilya

tucnak commented 1 year ago

To be clear, regardless of self-conceit in the OP, I'm only currently able to sponsor pgx in personal capacity to the best of my ability, yet I am absolutely invested to extract as much value for my work as well as research out of Postgres, and if there's value, and there is, without a doubt the company will back me up but I first have to make sure that at least my core logic is correct insofar the application goes. Very hard to do when you already feel out of your depth just getting started, and for somebody who is not proficient in either modern C, or Rust that certainly feels like it! A realtime system to correctly analyse chess games, their positions and variations all re: time series and predictive capability is not only something inherently valuable in a business, it's also a dream case study for me to represent in the company and attest to full extent of what's now possible if you get least of all a few half-competent data engineers like me to do the data layer completely in Postgres, almost literate programming style-like if you should know what I mean.

Hopefully this makes sense because I've realised that taken out of context the OP would be hard to respond to.

eeeebbbbrrrr commented 1 year ago

These are all great ideas and the reason pgx exists!

I don't foresee us (the core pgx development team) tackling any of these specific problems as pgx is more of a generalized framework for developing Postgres extensions in Rust. And that's the problem we're pretty good at solving.

That said, there's definitely an opportunity for an even larger community and ecosystem around pgx and extensions like like your ideas here. "all the nice in the space will probably begin and end with pgx" is a sweet thing for us to hear, and it directly speaks to the idea of a larger community and ecosystem. I won't speak for them, but I feel like the pgx project has a good relationships with Timescale, PostgresML, and other players in the extension space, and we sincerely hope for those relationships to continue to grow. I can speak for the ZomboDB guy tho, and I personally know he grows more and more excited about pgx' potential every day.

I'll leave this issue open for a bit -- someone from the community might come around and be inspired or might even reach out to you directly.

Thanks for stopping by and your interest in pgx!

tucnak commented 1 year ago

Hey thanks for getting on it and so quick!

I don't foresee us (the core pgx development team) tackling any of these specific problems

That is understandable, and nobody would expect that (hopefully) from you, however what I've struggled with is not being immediately familiar with internal apis of Postgres makes it hard to go from some intuitive idea like a PageRank extension, how it would be used in terms of aggregate apis, or even a custom index access method— to an actual decomposition, and it's something I was hoping to discuss! For example, could I have some internal function somehow leverage the custom index method all in one extension? (If not, that would invalidate the whole premise of some aggregate-indexed PageRank.) What of the aforementioned "ideas" would you guess could be the easiest to figure out based on your intuition of Postgres internals and pgx's own limitations?

To take one thing and simply focus on it but also with some concrete tutorial, o learning material it's hard to approach.

Best

workingjubilee commented 1 year ago

Hm. Your request for "Polars support" is... very general?

Other bits are a lot to digest, even though I am somewhat familiar with data-wrangling acceleration libraries and FFTs and even chess programming a bit, for Various Reasons. I tend to think more about what pgx is, the tools for building implementations for these things, than the implementations themselves, however.

We've been working a lot over the past year on a fair amount on fundamental improvements in terms of revamping the foundation of pgx to be more solid to build on and to allow more ambitious feature work while providing a safer and more intuitive Rust interface. A lot of what you describe sounds possible but must be hacked up on a case-by-case basis as we don't necessarily offer Rust (nevermind sound) bindings to the entire C extension interface of Postgres. All of these sound possible, sure, so it's hard to pick a "well, this one would be easiest" because it really depends a lot on exactly what you would be satisfied with and also how comfortable you are reading C headers.

workingjubilee commented 8 months ago

I'm going to close this because it doesn't seem very actionable as a specific idea, although I remain open to helping you with anything!