Open andreapairon opened 1 year ago
I'm using Triton's Python backend to load models from Hugging Face Hub. The disk size of config.pbtxt and model.py is around 12K, so MODEL_MULTIPLIER
ends up being the average model size, which can vary from 5G to 25G!
This difference impacts model placement decisions. We really need a better way to estimate/set model size.
Would be nice having a new parameter in the
InferenceService
CRD that allows user to specify the model size (the size in bytes), avoiding theMODEL_MULTIPLIER
factor to estimate the size.Is your feature request related to a problem? If so, please describe. The heuristic used to calculate the model size (model size on disk *
MODEL_MULTIPLIER
) is not always accurate because the amount of memory used by a model on a GPU could be greater and sometimes it could be possible to face OOM errors. Due to this problem the number of total models that can stay loaded on the GPU is not estimated correctly.We already faced this issues using Triton as serving runtime.
Describe your proposed solution New parameter in the
InferenceService
CRD that allows user to specify the model size, avoiding theMODEL_MULTIPLIER
factor to estimate the size.