Open dsgibbons opened 5 months ago
For some additional context, I'm using the Python backend in Triton. An example model that triggers unloading has custom dependencies via conda pack and has a file tree like so:
├── 1/
│ ├── model.py
│ └── model.pkl (approx 100MiB)
├── config.pbtxt
└── conda_env.tar.gz (approx 3GiB)
I'm not sure whether using models in this way messes with how ModelMesh computes usage.
Hi @dsgibbons Please see my reply here https://github.com/kserve/modelmesh/issues/82#issuecomment-1582028690, it might help you with the documentation on how modelmesh decides to load/unload models.
Maybe the DEFAULT_MODELSIZE
property can help you, especially if most of your models are of the same size.
DEFAULT_MODELSIZE
is used to estimate the model size if no prior knowledge is known about the model type before loading it.
According to the code documentation:
// conservative "default" model size,
// such that "most" models are smaller than this
Since most of our models have the same size, setting it to the correct value eliminated the WARN log you are seeing and helped modelmesh make better model allocation decisions.
Thank you for linking your reply @GolanLevy. I'd still love to see some formal documentation for this, as it seems like critical information that shouldn't require trawling through the issue tracker. I'll see how I go this week. I hope I'll eventually understand ModelMesh well enough to submit a PR to address this issue.
When loading some models, I receive the WARN log:
Memory over-allocation due to under-prediction of model size...
(which stems from here) followed by the INFO log:Eviction triggered for model ...
(I couldn't find exactly where this comes from). This unloading happens despite it being the only model on a large machine with 64GB RAM, 40GB VRAM and all of the k8s resource limits being set to max.I've tried to piece together how to avoid this from various GitHub issues (e.g., this one) but would really appreciate some clear documentation around how unloading is triggered in ModelMesh. Even variables such as MODELSIZE_MULTIPLIER as referenced by this reply aren't properly documented, and I can't find where they are used in either the
modelmesh
or themodelmesh-serving
source code.Could the documentation please be updated to formally describe how models are prioritized and subsequently unloaded with more discussion around the various configurations that we can alter on a per runtime/per isvc basis? I'm happy to contribute by helping to update the documentation, but I don't fully understand the underlying design decisions.