vendure-ecommerce / vendure

The commerce platform with customization in its DNA.
https://www.vendure.io
MIT License
5.39k stars 939 forks source link

Make Vendure serverless proof #1799

Open martijnvdbrug opened 1 year ago

martijnvdbrug commented 1 year ago

Is your feature request related to a problem? Please describe. Most (if not all?) serverless environments don't allow processing outside the requestcontext. Vendure's event handling has some async processes that occur outside the requestcontext. For examaple the apply-collection-filters

Event handling occurs after a response has been sent, because event.publish() is fire and forget (as it should be).

Describe the solution you'd like It would be nice if we can somehow be sure that no background tasks are running when a response has been returned.

Describe alternatives you've considered One option I see is to be able to implement our own remote eventBus strategy. We could do await eventBus.publish() and be sure that the event is published. Event handling can be done similar to how worker Job's are handled: in their own async request context if that's what's implemented in the implemented 'eventbus strategy'.

A workaround for Google Cloud Run is to enable background processes, which basically disables the serverless function and makes it a normal running container. (Resulting in a ~3x price increase)

Additional context This problem hasn't occurred yet, but if an eventhandler does resource intensive tasks that take longer than a few seconds, serverless environments would just kill the process.

This problem can occur with Google Cloud Run, Google Cloud Functions, AWS Lambda and probably similar products from other providers.

Is this something that needs to be implemented in Vendure, or is this a "won't support" feature? There is no shame in not supporting function/lambda like envs, but semi-serverless/autoscaling envs like Google Cloud Run are great platforms to save costs.

hendrik-advantitge commented 1 year ago

It occurs to me that the eventbus and job queue essentially accomplish very similar tasks. Would it be an option to merge the two together?

skid commented 1 year ago

This is, right now, the main thing missing for me in Vendure. We are running on GCR, which is an amazing service, but Vendure is not a terribly good fit for it because it relies on a long-running server. The main things that need to be addressed, IMO are:

  1. Everything that runs out-of-order wrt the request, should be, abstracted away in a task queue. If not possible - it should be "synchronous" - i.e. the request should not return until it's done. This means that events should be either awaitable or executed via a task queue. This is most likely a breaking change, but IMHO better make it early than late.
  2. Vendure should significantly reduce the startup time. It takes around 20 seconds to spin up an instance, which is a problem with autoscaling. We are, right now, experiencing some issues with random spikes in response latencies, most likely connected with the fact that new instances started by GCR take too long to become operational. I can't confirm this yet, so grains of salt and all that.

Here is a transcript from a Slack discussion between you, me and @michaelbromley on Feb 25 2022 (thanks to @ivasilov for digging it up)

discodancer Given that the event bus service publish method is not async, does that mean that the event hooks can't be awaited for? I just noticed this - it's not an issue at the moment, but it can become a serious one, and very hard to debug with serverless services like Google Cloud Run because they can arbitrarily kill the process after it's done servicing a request. This is why, for instance, using setTimeout / setInterval to schedule stuff is a bad idea. 12 replies

Martijn I think async handling of events happen in the background indeed. I have encountered some issues where I generate a PDF file on order-confirmation email, but these sometimes don't succeed because Google trottles CPU after the request.

Martijn Dirty hack: I schedule a job to call /admin every minute, which keeps the server alive. This handles most of the background processes as long as its not CPU intensive like generating PDF's

Michael Bromley Yes, that's right - the publish() method is intentionally unaware of any subscribers. Think about it the other way around - a long-running subscriber function that is awaited in the request path will increase the response time for the request. This is also undesirable. Furthermore, following the observer pattern, I think it is more expected to just fire an event and not be concerned with what any observers then do with that event. My idea for long-running tasks triggered by events is that they should be sent to the job queue immediately. So e.g. in the PDF example, all the subscriber would do is to call jobQueue.add(...) . Does GCR kill the server process so fast that even that would fail?

discodancer The jobQueue.add is awaitable, and generally pushing stuff to the jobQueue is fast, so we're ok there. As for GCR - it's really undetermined when it will kill the process. I've had situations where 10 queries were run and other situation where 1 query was run after a response was returned.... (edited)

discodancer But I agree - if you want to be serverless-proof, you should in principle design all async tasks to be able to be delegated to a job queue. (edited)

Martijn I completely agree that the publish should be unaware of subscribers. It's possible for me to work around it now that I know the CPU is throttled after the request. The issue remains that google can change this at any time. So maybe in the future it will delegate computing power directly after a request has been handled. I think it already allows less time now than a few years ago (Just a feeling, I have no way of backing this up :wink: ). Only thing I can think of is having a configurable eventBus strategy. That way the users using google can implement the eventbus using Google Cloud Tasks or PubSub or w/e. This way we can await the eventBus.publish() to publish within the request context, but subscribers will work similar to how worker jobs are added. Or even: a config that always publishes events to a worker queue.. (edited)

Michael Bromley The thing with the EventBus is that it needs to be able to publish in-memory objects to subscribers. For example, if we publish an OrderEvent which includes a reference to an Order object, then that object has getters which cannot be serialized. So using some external transport like we do with the job queue would be really tricky.

Martijn Hmm, also not sure what to do about it. It's also not something urgent I feel. I know alot of microservice env's use an external pub/sub or eventbus. I guess they just make sure everything is serializable

discodancer One other reason why I think events should be awaitable is that you sometimes explicitly want a change from an event to affect the response. For example, I recently opened an issue for an "OrderLine" event to be triggered when an order line is modified. The purpose of this is for me to have a chance to programatically add items if certain promotion conditions are met (e.g. you have a BOGO promotion, I want to add 1 beer mug each time you buy 2 beers). However, if the event is not awaitable, the request will return the old order state before the event listener adds the beer mug ...

Michael Bromley yeah that makes sense. So the current implementation which just exposes an rxjs observable stream is (AFAIK) impossible to make awaitable. I think we'd need to add a new method to EventBus which allows you to subscribe in a blocking way, so you provide an async callback which gets awaited as part of the publish() method. That way, it gives the subscriber code the power to decide on whether to block the response. So by default you'd always use .ofType().subscribe() for non-blocking functionality, but in those rarer cases you could uses .ofType().handle(async () => ....) or whatever.

discodancer That is one way to do it (I like it more personally). Another way to do it is to add a "hooks" (or something similar) service which would be just like events, but blocking. Since you would need to go and add an await to each place you call eventBus.publish, this is not much more work, and it makes everything more explicit. You could even extend the API in such a way that a service that calls an "OrderModified" hook will continue to use the order object returned by the hook. This is just from the top of my head, haven't really thought about this.

Michael Bromley The advantage of adding a method to the existing EventBus is that the event producer does not need to know or care about whether the subscriber wants to block or not. As long as we await eventBus.publish(...) , the consumer gets to decide. You could even extend the API in such a way that a service that calls an "OrderModified" hook will continue to use the order object returned by the hook This idea is giving me flashbacks of my days developing with AngularJS where you'd use "watchers" to react and mutation all sorts of things. My gut feeling is that something like this will quickly lead to un-maintainable code as the flow of logic & mutation becomes quickly unknowable. AngularJS made me very wary of over-reliance on event-based models. The nightmares begin where you end up with loops and cascading events firing and it becomes almost impossible to debug.

ivasilov commented 1 year ago

Or the slack link for convenience https://vendure-ecommerce.slack.com/archives/CKYMF0ZTJ/p1645746002899449

michaelbromley commented 1 year ago

Thanks @skid & everyone for the feedback.

Everything that runs out-of-order wrt the request, should be, abstracted away in a task queue. If not possible - it should be "synchronous" - i.e. the request should not return until it's done. This means that events should be either awaitable or executed via a task queue. This is most likely a breaking change, but IMHO better make it early than late.

So the core issue here is that event subscribers might get killed mid-processing because the response has already been sent. This issue is also discussed here:

As I mention there and in the slack conversation above, I am hesitant to completely alter the observer pattern on the EventBus - and also yeah this is actually not at all possible with rxjs as it stands. But maybe there is some clever way to have a global setting like "await all subscribers before sending response", and then we do some kind of ref counting on all events published in that context, and figure out some way of waiting for the subscriber functions to return. Again, I'm not at all sure whether this is technically possible, but it is worth some research.

Or the other option is, as has been suggested, to scrap the EventBus altogether and unify all async tasks into the job queue. One major issue with that is the overhead: the default job queue will add a row on every system event (of which there are many) and not only that, will have to poll in order to handle new jobs. That introduces not only the network latency to the events system, but also the polling latency as well as increased storage and DB load.

The great thing about the rxjs observable implementation is that it is very lightweight and fast, with no external systems involved.

I'm definitely interested in solving this and having a totally solid design for serverless architectures. I'm just not yet sure what the best option is. If anyone wants to put together a POC of some new approach, I'm very happy to look at suggestions!

Vendure should significantly reduce the startup time. It takes around 20 seconds to spin up an instance, which is a problem with autoscaling. We are, right now, experiencing some issues with random spikes in response latencies, most likely connected with the fact that new instances started by GCR take too long to become operational. I can't confirm this yet, so grains of salt and all that.

I think there is a lot of work that can be done here with regard to things like deferring work until first needed. I believe @hendrik-advantitge & team have done a little research into this are for the same reason, as they are running on GCR too.

skid commented 1 year ago

Here's a few thoughts:

  1. Vendure internals should not depend on RxJS events, so the code needs to be rewritten to make use of one of the two mechanisms instead of RxJS. The event system as it is now is not serverless-safe, and being async, it does not allow me to actually await the events. So, for example, If I want to react to an Order event and modify the order event before it goes out back to the user - I can't - because by the time the event handler gets executed, the response is already sent.
  2. Instead, Vendure internals should depend on a synchronous event system in combination with a task queue for async event processing. Basically, we should assume that async stuff always happens off-process and everything that is not off-process finishes before the response is sent.
  3. The RxJS event system should not be changed or removed because of backwards compatibility, we can deprecate it the mid-term.
  4. I understand that a database-polling mechanism is a simple solution for a task queue, but It should not be the bottleneck for improving the platform. I also don't think the polling mechanism is the problem in itself, but the amount of tasks that are generated. I would have my fans spin up when working locally because on every change Vendure added thousands of events to recalculate the categories (I have a lot of categories). There's 2 ways to solve this that I can think of: 1) reduce the need for a task queue. For example - reindex products immediately upon saving instead of adding a task. and 2) increase the polling interval (I personally don't see a problem to have the index be updated after 1,3 or even 10 seconds).
  5. Another thing to consider is to ship Redis by default ...
xrev87 commented 1 year ago

So not necessarily related to a requirement to run this on serverless arch, but for a while now I've been grappling with a problem that I think could be solved here and also in turn solve for the original issue.

That problem is the need for a centralised event bus - right now for me there are two types of events:

Right now by running things in a farmed environment, the only way to cater for 2 is by using the job queue, but that creates job entries for each event and is therefore undesirable (it's equally undesirable when using search in unbuffered mode).

As well as wanting some events to be consumed by multiple instances of vendure, I also want to be able to subscribe to those events and sink them into other places from independent services.

I like the simplicity of RXJS, I do - but at the same time I often wonder if we could have different strategies for the event bus; so maybe I could plug in Rabbit, Kafka or the like. In so doing - we could solve for the original issue, arguably with some additional deployment complexity, but you'd gain a lot as well.

Where this gets complicated is:

ashitikov commented 1 year ago

It occurs to me that the eventbus and job queue essentially accomplish very similar tasks. Would it be an option to merge the two together?

I had the same thoughts, but as stated above: It'll introduce a lot of jobs in job queue, and personally for me, I'll not be able to have a way to "intercept" events and modify data "in-place" inside transaction.

Events that need to be consumed by all instances of Vendure

Having common subscriber, like this.eventBus.subscribe(event => ...) may solve this and a lot of other problems. With such mechanism you may integrate 3rd party queues or transports, and push event into it.

I would like to have a middleware-pattern for EventBus. RxJS pattern may be saved for backward-compatibility.

const composer = new EventBusComposer();
this.eventBus.use(composer); 

// Example 1: put a new job only for OrderEvent
composer.use(async (event, next) => {
  if (event instanceof OrderEvent) {
     await this.jobQueue.add({ event })
  }
  await next();
})

// Example 2: prevent next middleware call for specific transition
composer.use(async (event, next) => {
  if (event instanceof OrderStateTransitionEvent && event.state === 'ArrangingPayment') {
    return;
  }
  await next();
})

// Example 3: composers can be mixed
const composer2 = new EventBusComposer();
composer.use(composer2);

// Example 4: sugar
composer.ofType(OrderEvent, event => ...);
composer.catch(error => ...);
composer.use(asyncAfterTransactionCommit); // Pre-defined middleware

// Example 5: backward-compatibility composer which is fired after transaction complete
const composer3 = new EventBusComposer();
composer3.use(asyncAfterTransactionCommit); // Now everything below will be called after transaction complete
composer3.ofType(OrderEvent, event => {
  if (event.customFields.test === 'test') {
    Logger.warn('Test in the custom fields');
  }
});
this.eventBus.use(composer3);

// Example 6: publish event
// Notice await here
await this.eventBus.publish(new OrderEvent(...));

Middleware pattern has proven its flexibility, and can be used in many ways in vendure, including integration with 3rd party services, handling events in blocking and non-blocking way.

Logic-Bits commented 1 year ago

Hi all,

saw the documentation here: https://www.vendure.io/docs/developer-guide/deployment/#serverless--multi-instance-deployments but as far as I understand, only the worker works 100% multi-instanced. The core / server can only run once yet, correct? (also some kind of load balancing service is missing for the requests i assume)

Thanks!

michaelbromley commented 1 year ago

@Logic-Bits both the server and worker can be run with multiple instances in parallel. There's an updated docs page with more information about this here: https://www.vendure.io/docs/deployment/horizontal-scaling/

michaelbromley commented 1 year ago

@ashitikov I just re-read your middleware-based proposal and I find it very interesting. Would you be interested in developing a proof-of-concept implementation of this idea?

I'm currently working to wrap up the breaking changes for v2.0, which is why I'm revisiting this. With a proper compatibility composer it looks like this feature might be possible with breaks.

Logic-Bits commented 1 year ago

Ahh, thank you very much. The docs are clear now!

ashitikov commented 1 year ago

@ashitikov I just re-read your middleware-based proposal and I find it very interesting. Would you be interested in developing a proof-of-concept implementation of this idea?

I'm currently working to wrap up the breaking changes for v2.0, which is why I'm revisiting this. With a proper compatibility composer it looks like this feature might be possible with breaks.

Yes, I am interested in, but honestly I don't have free time now for that. Possible in Q2 I think What possible breaks do you see? I think it's fully compatible with current version.

michaelbromley commented 1 year ago

@ashitikov OK, in that case I will defer this to a later release after you get time to work on it. If no breaking changes we can get it in a 2.x minor release.

Logic-Bits commented 10 months ago

@Logic-Bits both the server and worker can be run with multiple instances in parallel. There's an updated docs page with more information about this here: https://www.vendure.io/docs/deployment/horizontal-scaling/

Hi @michaelbromley , just a follow up question. If we scale the server with multiple instances, how would we ensure that the DB changes wont be handled / applied multple times?

michaelbromley commented 10 months ago

@Logic-Bits with multiple instances of the server, any given request will still only arrive at a single instance (typically as chosen by the load balancer). So there should be no danger of the same request triggering multiple updates.

Logic-Bits commented 10 months ago

@michaelbromley i mean on startup, when DB changes are applied. either through synchronize:true or migration.

michaelbromley commented 10 months ago

@Logic-Bits Ah I understand. There's no built-in handling of this scenario, but it's discussed here:

Perhaps it would be a good feature to build-in to Vendure's migration wrapper!

Logic-Bits commented 10 months ago

THX! Will check it out over there.