Closed pitkes22 closed 1 year ago
Hey,
thanks for a well described issue. At the first look, it seems like a bug in core. Async webhook should be non-blocking.
The implementation of the app indeed should (and will be) improved (we are working on migration to non-serverless infra where we can utilize queues and long running workers). Nonetheless app should not degrade Saleor performance (except of sync webhooks where this sync by design).
We will look into that
Hey
I tried to reproduce it on Saleor Cloud
Looks like there is no meaningful difference with Search App being enabled or disabled
Saleor itself is async on webhooks: https://github.com/saleor/saleor/blob/main/saleor/plugins/webhook/tasks.py#L187
I assume you are using self-hosted Saleor? You can try free developer cloud sandbox and check if you still have a problem.
Hey @pitkes22 ,
Just to confirm. Do you have celery to handle the asynchronous Saleor's tasks?
Also, make sure that settings.py::CELERY_TASK_ALWAYS_EAGER
is set to False.
From the celery docs:
CELERY_TASK_ALWAYS_EAGER: If this is True, all tasks will be executed locally by blocking until the task returns. apply_async() and Task.delay() will return an EagerResult instance, which emulates the API and behavior of AsyncResult, except the result is already evaluated.
So locally if you have CELERY_TASK_ALWAYS_EAGER
== True, then all async webhooks will be called synchronously.
Hey, thank you for the response. I feel stupid :man_facepalming: You are absolutely correct, I am self-hosting, and since I am in the pre-production environment I assumed that the Celery Worker is something I have to only worry about when I want to achieve better scalability of the platform. Also, when I was looking into the Saleor logs, I saw a lot of logs mentioning Celery Tasks so I assumed that there is something like a built-in instance in the Saleor.
From checking the documentation now, I found it quite unclear what is the behavior when there is no worker. The is a nicely documented flow of the asynchronous event when there is a worker, but no mention of the behavior without the worker and that running without the worker will cause all the "asynchronous" webhooks become synchronous. However, I assume that this is not the intended use of the platform, but anyway, it would be nice to mention this at least in the self-hosting guide.
Of course, setting up the celery worker solved the issue.
But when I think about it, my point about returning the 200
response as soon as possible still holds because time to make a request to Algolia * number of product variant
can easily exceed the webhook timeout. Enabling the celery worker solved the problem of mutation being blocked by the webhook, but the underlying issue is still there.
But when I think about it, my point about returning the 200 response as soon as possible still holds because the time to make a request to Algolia * number of product variants can easily exceed the webhook timeout. Enabling the celery worker solved the problem of mutation being blocked by the webhook, but the underlying issue is still there.
I'm not sure where the issue is now (except docs not mentioning this). Of course, returning immediately 200 will speed this up, but it will only cause worker to spawn a new job faster, by cleaning up the queue faster. But the cost of that will be a risk of not executing Algolia operation (Saleor will not repeat the failed job and you will never know about it).
Long term we are working on changing the stack of apps. We want apps, including Algolia to work around the queue, so Saleor will call e.g. SQS, and receive an immediate 2xx response. Then, a queue dedicated to apps will process events for apps locally. With a queue, we can ensure proper execution. However, we need some time to introduce this new architecture. Until we have it, we need to move this responsibility to Saleor Core itself
I am using the Search app, and after enabling it I noticed performance degradation when editing the products. The most noticeable performance degradation occurred when trying to change the product category.
For example, a simple mutation like this:
Will run for multiple seconds with Search App enabled (Sometimes even not finishing at all and finishing with
504 Gateway Timeout
):After disabling the Search App the response time fell back to the milliseconds range.
First, it occurred when editing the products in the dashboard but then I reproduced it also with a simple mutation above. Even weirder is that it causes a complete halt on all GraphQL requests. While
updateProduct
mutation is executing, it is not possible to execute any other mutation, for example, add product to cart. But when the long-running mutation finishes, at the same moment the other mutation also finishes.Initially, I was expecting that there was something wrong with my deployment and that the delay was between My Saleor Instance and Search App Instance, but after checking it out there was no major delay other than a couple of network hops through reverse proxy.
I checked the Saleor logs and saw that it was reporting that the
product_updated
webhook to the Search App was timing out. So I checked the implementation of theproduct_updated
webhook and figured out that the Search App is responding to the webhook only after updating the changes in Algolia.https://github.com/saleor/apps/blob/7e0755ec9ef515130badeea1770548002bc206e8/apps/search/src/pages/api/webhooks/saleor/product_updated.ts#L32-L40
This means that the roundtrip for the webhook looks like this:
Product Update Mutation
->Product Updated Webhook
->Search App
->Request to Angolia (for each product variant!)
and only after all of this the200
is returned andproductUpdate
mutation finishes.I might have misunderstood the concept of Asynchronous webhooks but after reading the documentation and experimenting with webhooks myself I know:
So I think that the webhook should return
200
immediately after receiving the data and not use the response/return code to indicate the status of the subsequent webhook task. But of course, this complicates things a bit because now the app can't rely on webhook redelivery to retry when an update to Algolia fails. I am not sure how this should be handled from an app design perspective.I also tried to implement the fix myself by just moving the code that returns
200
up a couple of lines and it indeed fixed the issue.This resulted in a much lower response time: