supabase / edge-runtime

A server based on Deno runtime, capable of running JavaScript, TypeScript, and WASM services.
MIT License
642 stars 55 forks source link

InvalidWorkerCreation: Edge functions cannot handle concurrent requests #408

Open nathanaeng opened 6 days ago

nathanaeng commented 6 days ago

Bug report

Describe the bug

Making concurrent requests to a Supabase edge function will result in InvalidWorkerCreation errors or 502 errors.

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Using the Supabase CLI, create a new function with supabase functions new test_concurrency. Here is an example of a function I have (I realize the createClient is not used):
    
    import "jsr:@supabase/functions-js/edge-runtime.d.ts"
    import { createClient } from 'jsr:@supabase/supabase-js@2';

console.log("Hello from Functions!")

Deno.serve(async (req) => { const supabaseClient = createClient( Deno.env.get('SUPABASE_URL') ?? '', Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? '', ); const { name } = await req.json() const data = { message: Hello ${name}!, }

return new Response( JSON.stringify(data), { headers: { "Content-Type": "application/json" } }, ) })


2. Run `supabase functions serve`

3. In a new terminal tab, execute this bash script which sends 200 concurrent requests, replacing `SERVICE_ROLE_KEY` with your service role key:

!/bin/bash

seq 1 200 | xargs -n1 -P0 -I{} curl -L -X POST 'http://localhost:54321/functions/v1/test_concurrency' -H 'Authorization: Bearer SERVICE_ROLE_KEY' --data '{"name":"Example"}'


4. Notice how it will successfully execute the function for the first 100 or so requests, before erroring on the `supabase functions serve` tab:

InvalidWorkerCreation: worker did not respond in time at async UserWorker.create (ext:sb_user_workers/user_workers.js:145:15) at async Object.handler (file:///root/index.ts:154:22) at async respond (ext:sb_core_main_js/js/http.js:163:14) { name: "InvalidWorkerCreation" }

with the following error message on the tab that executes the test script:

{"code":"BOOT_ERROR","message":"Worker failed to boot (please check logs)"}



## Expected behavior

I would expect the edge function to be able to handle concurrent requests to this degree.
## Screenshots

<img width="897" alt="image" src="https://github.com/user-attachments/assets/ec131dda-be11-41d3-8496-b28a1db30f5c">

## System information

- OS: macOS, M3 Max
- Browser (if applies) [e.g. chrome, safari]
- Version of supabase-js: 1.192.5, using supabase-edge-runtime-1.58.2 (compatible with Deno v1.45.2)
- Version of Node.js: 18

## Additional context

From my understanding, edge functions can be used to serve API routes, and in a production application it is perfectly reasonable that you would have 200 users hit the same endpoint at the same time. This example uses an edge function with minimal computations. If you add database reads, a text embedding call using `Supabase.ai gte-small`, and a database write, it can handle even fewer concurrent requests (around 40 from my testing). I noticed this issue at first because I wanted to generate text embeddings on seed data consisting of only 40 users (which gets triggered on inserts to a table) but it failed to work for every user.

I'm not entirely sure how edge functions work, maybe a worker is being re-used to handle multiple requests and then a CPU limit or similar is hit, resulting in failures - but I thought the idea of edge functions is to scale up with requests and a mere 200 requests is nothing.

At first I thought that this could be a problem with local Supabase running in Docker, but I also confirmed this occurs on a remote Supabase project (ran using Supabase to host) - where I get 502 errors after the first 50-100 requests or so.
ethan-dinh commented 6 days ago

I have encountered a similar issue when trying to call an edge function multiple times concurrently. In my case, making a lot of calls resulted in InvalidWorkerCreation errors or 502 errors. It seems that the scaling ability of edge functions might be limited and this significantly impacts performance when concurrent requests spike.

I feel like other serverless functions can handle concurrent requests with ease, yet edge functions can't even handle 50? Is Supabase not equipped to handle more than 50 concurrent requests? It seems as if the edge function is attempting to create a worker for every single request rather than queuing or using some implementation to resolve concurrency on a large scale.

nyannyacha commented 6 days ago

Hello @nathanaeng and @ethan-dinh

I am not a member of the Supabase team that works on Supabase Edge Functions, but as the edge runtime maintainer, I'm sorry I didn't meet your expectations 😞

With the user script code and bash script you posted in the description and assuming you're using default edge runtime policy settings in supabase/cli then, I can explain why the edge runtime is showing such low request throughput.

The edge runtime has three main scheduling policies(per_worker, per_request, oneshot) for workers, and for developers convenience, supabase/cli defaults to whichever of these scheduling policies is not used by Supabase Edge Functions. (aka. oneshot policy)

Unlike the other policies, the oneshot policy does not reuse workers but rather creates a new worker and forwards a request to it, even if they have the same service path. The reason supabase/cli chose this policy as the default is that the source code can be changed by developers at any time, so that the next request will reflect the changed source code. So it is not used in production(and Supabase Edge Functions) because it is highly inefficient for the reasons described above.

If you change the policy, I think you'll probably get a different result.

I was able to reproduce your issue exactly locally on the oneshot policy using your code, but I was also able to confirm that the per_worker policy is not affected by this issue.

Of course, my experience doesn't guarantee that you won't have the same issue with Supabase Edge Functions.

Today, I came across an author on Reddit discussing this same topic, and it seemed that the author was also experiencing these issues with Supabase Edge Functions.

My expectation is that these issues should be handled well by the per_worker policy, but it looks like sometimes it's not able to properly forward the many request traffic to the workers and just gives up. (Forgive me, I have very limited visibility for Edge Functions because I am not a member of the Supabase team).

I have opened PR-382 to better handle this situation, and once this is merged, they will be able to implement more specific request scheduling on top of the per_worker policy, which I believe will mitigate these issues.

I will put this on my watchlist and will let you guys know if there are any updates on this issue in the future.

Have a great day!

nathanaeng commented 6 days ago

Thanks for the detailed response! Yep, I have looked into the per_worker policy and while it might work fine for the simple edge function I provided above, it was failing for a more complex edge function that performs a read, text embedding, and write. I can't recall how many concurrent requests it was able to handle, it might have been a bit more than oneshot but it was still underwhelming unfortunately. Additionally, I was able to replicate this error on my remote DB (Supabase hosted) which makes me think it's not just a local hosting issue. Thanks for helping though!

thurahtetaung commented 4 days ago

Hello @nyannyacha , thanks for your detailed response. As someone who self-hosts edge functions separately (not together with supabase docker compose), where should I go about changing the policies you mentioned? I suspect it is in the main function index.ts with forceCreate = true or false but I am not sure and I am still getting those 502 errors after 30-50 concurrent requests even with the forceCreate = false option. Can you help me figure out some other configurations in the main function where I can optimize for better scaling performance? I am running it in multiple replicas in my K8s deployment but the replicas still cannot pass the load test because the edge runtime container stop responding to requests and return 502 with the above error after a few concurrent requests.