Closed samos123 closed 8 months ago
I would prefer that we remove this portion of the system test. This is covered under the integration test where we have control over the backend's response time so it is not by nature a race condition.
I think it's important we keep this to ensure scaling back to 0 and then scaling to more than 1 replica works in a real environment with multi threading and python clients creating many openai clients in parallel. It would have prevented an issue and the integration test didn't catch it before.
This will reduce flakiness of the system tests. Ocasionally the system test won't scale to 3 replicas because it's too fast at processing 500 requests.
So instead the deployment is patched to wait for 10 seconds before starting the main container. This causes the queue to be build up large enough and cause it to scale up to 3 replicas.