I'm trying to run in GKE on an A100 exllamav2, but I'm having trouble getting it to warm up. I'm currently using the old stream generator+speculative decoding with my own modifications and my own server on fastapi+uvicorn. I've done a few different warmup options:
I send a prompt of length 4096-256, ask to generate 256 tokens.
I send a prompt of length 1 token, ask to generate 4000 tokens.
I send a prompt of length 40 in increments of 1 to 4000: 1, 41, 81, etc, 4001.
The problem is that on GKE after this warmup, when I submit a prompt of length 1000, 128 tokens are generated 3-4 seconds for the first request and about 1.5 for all subsequent requests. After that I submit a prompt of length 2000 and the situation is the same - the first generation takes about 7 seconds, subsequent ones about 2. When I run the same thing via docker compose - everything works fine and there is no problem with the first requests being so slow after warmup.
Do you have any suggestions what this problem could be related to?
I'm sorry, I don't really know much about GKE. From what you describe I'd have to assume there's some kind of lazy initialization going on on Google's end?
I'm trying to run in GKE on an A100 exllamav2, but I'm having trouble getting it to warm up. I'm currently using the old stream generator+speculative decoding with my own modifications and my own server on fastapi+uvicorn. I've done a few different warmup options:
Do you have any suggestions what this problem could be related to?