Return 429 instead of 503 when worker job queue is full

🚀 The feature

Currently when using the rest api, the prediction endpoint returns a 503 when the number of concurrent requests is larger than a worker's job queue. It would be great if we could get a 429 so that we know the service is not available due to high request load.

Motivation, pitch

We'd like to be able to disambiguate errors caused by too many requests from other transient 503s (either on server or service mesh side). Having the server return 429 would allow the client to handle retries differently in the case of high load.

Alternatives

No response

Additional context

No response

pytorch / serve