With Mixtral + TGI, is it actually required to fit the full model in VRAM? Or, is it possible to opt for 100GB+ of system memory with lower GPU capacity?
ml.g5.48xlarge instances are quite expensive, so I’m looking for options to reduce deployment costs.
Hi, thanks for publishing this example.
With Mixtral + TGI, is it actually required to fit the full model in VRAM? Or, is it possible to opt for 100GB+ of system memory with lower GPU capacity?
ml.g5.48xlarge
instances are quite expensive, so I’m looking for options to reduce deployment costs.