Closed avs20 closed 5 months ago
Hi @avs20!
Your engine build is failing because the context sizes are too large to fit on VRAM of the GPU configuration you are using. You should set the sizes of the input and output contexts very carefully to match your intended workload, because you pay a hefty penalty in VRAM (and hence cost of inference) to support such long contexts.
If you actually need 15k total input/output context, you're going to need to distribute across multiple GPUs -- my guess is that should fit in 8x80GB A100/H100 for an 8GB model with small enough batch sizes, but you'd have to do the math to be sure. Consider quantizing the weights and KV cache as well.
If you want to follow up further, hit me up in the Modal slack (https://modal.com/slack).
Hi I tried running the tensortLLVM code from the local machine. The changes I did was increase max token for input and output to 7500. and reading the prompts from a txt file.
I also commented the web api code as I was going for batch inference.
Here is the image id -
Building image im-E7GCowSnqMBcKNtQ89JEoo
here is my entrypoint function
Here is the traceback
Am I doing anything wrong ?