GKE A100 slow inference

I'm trying to run in GKE on an A100 exllamav2, but I'm having trouble getting it to warm up. I'm currently using the old stream generator+speculative decoding with my own modifications and my own server on fastapi+uvicorn. I've done a few different warmup options:

I send a prompt of length 4096-256, ask to generate 256 tokens.
I send a prompt of length 1 token, ask to generate 4000 tokens.
I send a prompt of length 40 in increments of 1 to 4000: 1, 41, 81, etc, 4001. The problem is that on GKE after this warmup, when I submit a prompt of length 1000, 128 tokens are generated 3-4 seconds for the first request and about 1.5 for all subsequent requests. After that I submit a prompt of length 2000 and the situation is the same - the first generation takes about 7 seconds, subsequent ones about 2. When I run the same thing via docker compose - everything works fine and there is no problem with the first requests being so slow after warmup.

Do you have any suggestions what this problem could be related to?

turboderp / exllamav2

GKE A100 slow inference #541