turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.48k stars 258 forks source link

GKE A100 slow inference #541

Open vt404v2 opened 2 months ago

vt404v2 commented 2 months ago

I'm trying to run in GKE on an A100 exllamav2, but I'm having trouble getting it to warm up. I'm currently using the old stream generator+speculative decoding with my own modifications and my own server on fastapi+uvicorn. I've done a few different warmup options:

Do you have any suggestions what this problem could be related to?

turboderp commented 2 months ago

I'm sorry, I don't really know much about GKE. From what you describe I'd have to assume there's some kind of lazy initialization going on on Google's end?