Project maintanence going forward

thenonameguy commented 5 years ago

Hey,

It seems with the recent acquihire of the main developers of Distributed Masonry the project development has slowed down. With the advent of better maintenance options for open-source projects from the Clojure community (Clojurists Together/Patreon) it would make sense to apply to those programs to fund development of Onyx.

We just integrated Onyx in a production system and it would be nice to keep up dev efforts as I think this project is still the de-facto Clojure data processing platform with it's unique approach.

I would like to hear some thoughts from the devs what the future holds for this project realistically and what we can do together to keep it going forward.

@lbradstreet @MichaelDrogalis

crimeminister commented 5 years ago

I'll throw in Open Collective as an additional source of potential funding. I am a great fan of Onyx and, while I don't have it running in production, would happily contribute my pittance to help keep it live and growing.

MichaelDrogalis commented 5 years ago

Thanks for raising this, @thenonameguy. We're glad that Onyx has continued to be useful to the community. @lbradstreet and I have two thoughts:

Your assessment is right. As much as we wish otherwise, @lbradstreet and I don't have the time to invest in Onyx like we used to. We're really proud of what we've built, but there are only so many hours in the day.
Before you consider opening up a source of donations, does anyone have a realistic idea of who could contribute to the project and make use of those funds? Onyx is admittedly a big project, and while it's generally in good shape, anyone who maintains it needs to have a reasonable foundation in distributed systems. We're happy to have others maintain Onyx -- we'd just like to make sure it's in the hands of the right folks.

thenonameguy commented 5 years ago

Thanks for the response! I posted this discussion on the #clojure Slack channel, maybe someone shows interest in becoming a regular maintainer. As an alternative of this solution, it came to my mind that we could fund development on a per feature/bug basis (for example via Bountysource), leaving the solution evaluation to your free time, keeping complete ownership of the codebase you created.

I think our company would fund some of the proposals that are in your backlog over time.

the2bears commented 5 years ago

Would love to help out. I just need to figure out a way through the red tape at work.

MichaelDrogalis commented 5 years ago

As much as we'd like to, @lbradstreet and I are booked up on time. That said, we don't feel the need to have complete ownership over the codebase. We're more than happy to have other maintainers take the wheel. The only thing we'd like to avoid is passing it to someone at random (and avoid an event-stream-like situation :) )

neuromantik33 commented 5 years ago

We @ Oscaro use Onyx in production and would also like to see it continue as we've invested considerable resources into building pipelines and IOs (Google Cloud which we plan to open source). We can chip in but I'll admit that some of the codebase needs some serious documentation and/or more tests as we have noticed things (my colleague here) that really require that @MichaelDrogalis or any of the original authors shed some light on. The lifecycle FSM and some of the trickyness involved can be quite daunting without any annotations. Oh and aeron is also up there in complexity and instability, sometimes we were thinking of just switching to another transport with an easier learning curve like GRPC and more cloud friendly. Anyhow glad to see things finally moving around here :)

lbradstreet commented 5 years ago

@neuromantik33, it's pretty fair to describe the lifecycle FSM as daunting. We only had so much time, and we needed to make performance somewhat comparable to Flink, so I'm afraid some of the readability was sacrificed in favor of performance there.

I personally found Aeron pretty reasonable to run reliably, however it was certainly susceptible to GC issues if run in process, and I could see it being pretty easy for the clients timing out themselves when the peers GC. It may be worth starting out by trying to tune the Aeron client and media drivers in a way that would make the behavior more tolerant normal streaming workloads. That said, the aspect of running a separate media driver process and peer-group do make it a little harder to setup for the first time.

Ultimately, building a full streaming platform, including plugins, with only a few people is a big job without significant community involvement. I think the important idea is the data driven, flexible DSL. It may be worth investigating whether the DSL aspects could be mapped onto something like Flink, so you can get the benefit of the runtime and community engagement, without the overhead of building all of the aspects of a distributed streaming platform.

arnaudbos commented 5 years ago

While I certainly like Flink, IMO Onyx shines not only because of its data model but also because of its "just a library" approach. Flink is a framework and although they have a FLIP and a few JIRA tickets opened to take advantage of so-called "container modes" for resource management and auto-scaling, I think Onyx is in a really good position to offer these kinds of features a la carte (i.e. not baked into onyx core), am I right?

That said, I've been in and out of exploring onyx core and plugins for a few weeks and I would be glad to help people with more experience in the distsys field. A "beginner friendly" tag on issues and a guided tour of the code base would be really nice. EDIT: I notice there's a "newbie" label, fair enough.

thenonameguy commented 5 years ago

Unrelated note @neuromantik33 : On the Aeron reliability side it turned that we were hit by a known issue for setting CPU limits in Kubernetes: https://github.com/kubernetes/kubernetes/issues/67577 After disabling those the number of exceptions related to timeouts between Onyx and Aeron fell significantly.

jgerman commented 5 years ago

@thenonameguy I don't want to pollute this Issue so I made a new one: https://github.com/onyx-platform/onyx/issues/890

I'm curious what reliability issues you saw, we've been fighting aeron exceptions for a few weeks now.

MichaelDrogalis commented 5 years ago

@arnaudbos In the same vein as @lbradstreet's suggestion, I always thought it would be interesting to map the Onyx information model onto Kafka Streams, since that one truly is just a library.

arnaudbos commented 5 years ago

@MichaelDrogalis indeed, describing a Kafka Stream topology as an Onyx job would allow to decomplect (sorry) Processor nodes, especially with lifecycles.

What about peer capabilities though?

Storm has a "tag-aware scheduler" and in Flink, the API provides a way to assign operators to specific slots (I think they're called slots) with specific parallelism.
AFAIK (unless it's somewhere in a KIP), Kafka Streams doesn't have a way to specify a custom Stream Thread "assigner" (don't know which word to use: partitioner, scheduler, they're all overloaded...) and every instance runs all the processors. Onyx has peer and task tags which is really valuable (in the Clojure community at least, as illustrated in the docs by a Datomic license, but also for uneven task loads).

What about savepoints?

Onyx and Flink both use ABS for state checkpointing and Flink has implemented user triggered savepoints on top of it. I still have to wrap my head around that concept (and oh boy dig a lot to understand the implementation details...), but something tells me it would be feasible in Onyx, while in Kafka Streams there's an open ticket somewhere but is not implemented yet.

Quick question:

Does Onyx currently support what Flink calls "End-to-End Exactly-Once Processing"? I think not. Their two-phase commit abstraction is interesting but I'm not sure how it could be mapped to Onyx's masterless design since Flink relies on the Job manager for such purpose.

Thank you

I think many of us, in the Clojure/Onyx "community", know you have a lot to do at Confluent, so thank you for taking the time to discuss here.

Mapping the Onyx information model on top of Flink/Kafka Streams or trying to go forward is a matter of tradeoffs it seems.
People relying on it for production will probably be driven to keep it going, I personally like the learning aspect of it.

MichaelDrogalis commented 5 years ago

Full disclose before I dig in -- a large part of my current role at Confluent is managing Kafka Streams and KSQL. I do have a mild vested interest when I suggest this idea, though I do think it's a good one anyway. :)

KStreams leans on orchestration tools like K8s to manage which applications are participating in particular flows. I think this is ultimately the right solution. Had K8s been more mature when Onyx was being developed, I probably wouldn't have implemented tags, and instead documented some recipes about how to do this with other tools that are more capable.
KStreams doesn't use savepoints, but it does back up its state to underlying Kafka changelog topics. In practice, this yields about the same result that matters: being able to restore state across applications.
I can't remember where we left off with the implementation, but any end-to-end exactly-once semantics would need to depend on the sources and sinks providing some notion of exactly-once, too. We definitely didn't cover that for all supported plugins.
Thanks so much for saying that and supporting Onyx. This project has been a brilliant piece of my life, largely thanks to everyone who took an interest in it. It would be great to see it continue in any form since I do think there are a lot of good, small ideas that make up the whole project.

solatis commented 5 years ago

To steer the conversation back to the original topic, I think it's important to have at least some high-level maintenance of the project, someone who is able merge PRs, do releases, etcetera.

We are still heavy users of Onyx, but had to fork from the mainline Onyx due to the lack of updates (e.g. Clojure 1.10 support, but also other stuff). I would be happy to volunteer picking up day to day operations for the project, to make sure PRs get reviewed, releases can still be pushed out in a timely manner, etc. As both Michael and Lucas know me, it would also avoid event-stream-like situations.

MichaelDrogalis commented 5 years ago

@solatis I'd be very happy to have @solatis aboard. I can set up full access if @lbradstreet agrees.

lbradstreet commented 5 years ago

@solatis, that sounds great! I can walk you through the CI and release process some time too.

solatis commented 5 years ago

Great let’s discuss the logistics over direct message.

MichaelDrogalis commented 5 years ago

@solatis I have invited you as an owner of the onyx-platform GitHub organization.

arnaudbos commented 5 years ago

@solatis if you want to explicit any guidelines for opening issues, submitting PRs, which communication channel(s) to use, branching model (, etc.) that would work best for you, please advise.
I'd like to contribute here and there when I can.

solatis commented 5 years ago

@MichaelDrogalis @lbradstreet I think I still need additional instructions / access before I can push a release to Clojars. I see there's quite a bit of magic in a lot of places to make this happen.

I'm available over Slack if you want to reach out to me directly.

MichaelDrogalis commented 5 years ago

Hey @solatis. I've added you to the Clojars org. You should be able to trigger releases now with the release scripts (they're under script/) in each repo. Can still answer any questions as needed though.

solatis commented 5 years ago

My clojars username is also @solatis .

What is the normal release process like? I manually updated relevant files and tagged branches, and it seems like a deploy of 0.14.5 to Clojars did succeed, but CircleCI generated a permission / connection failed error => https://circleci.com/gh/onyx-platform/onyx/6868

It could be an unfortunate network glitch, though.

thenonameguy commented 5 years ago

Hey @solatis! 👋
Would you be interested in pushing your internal fork of Onyx with your improvements (clj 1.10 support, etc.)? We are also pretty close to forking the repo for our own needs, it would be nice if we had a more up-to-date base, even if it contains some breakage.

solatis commented 5 years ago

@thenonameguy are there any specific features you are looking for? I've merged most of the things, and we're using Onyx + clojure 1.10 ourselves without any problems at this point.

matanox commented 4 years ago

Experience with similar acquihires shows that the project simply dies over the course of few years, regardless of potentially faint reassuring statements being declared early on or not, whereas a new (differently commercial) similar project coming from the acquiring company does not always emerge. It might be good to assume something along these lines, but this is just my small piece of mind when coming here to check on the progress of this framework.

crimeminister commented 4 years ago

There seems to be some more energy being invested into wrappers for Beam / Dataflow, which I appreciate, but Onyx is still the best thing I have personally used in this space.

youvere commented 4 years ago

I've worked with both in clojure so far (beam/dataflow) and onyx. The simplicity to express pipeline in onyx is impressive

RBerkheimer commented 4 years ago

Like Mike said last year(?) or when he and Lucas left, he believes the information model Onyx built was great, and that's what should be persistent. As it stands, those who use Onyx, what would be needed to continue to ensure Onyx maintains a growth and security posture? Extension to other languages? Improved plugins? Improved security posture? What is needed to make this project considered 'active'?

onyx-platform / onyx

Project maintanence going forward #887