2x docker image size increase for trtllm: from 8.38 GB (24.03) to 18.46 GB (24.04)

lopuhin commented 3 weeks ago

System Info

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Observe docker images sizes on https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags for trtllm:

nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 is 18.46 GB
nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 is 8.38 GB
the issue still persists in nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 which is 18.48 GB

Expected behavior

Docker image size remains around 8 GB as in previous releases

actual behavior

Docker image size increased to more than 18 GB in 24.04 and is still high

additional notes

docker image size is important when autoscaling is used, as pulling larger docker images takes more time

byshiue commented 3 weeks ago

Here is the discussion.

handoku commented 3 weeks ago

Hi, everyone. As a user of trtllm backend.

I notice that a model.py added in the main branch. Are you going to replace this c++ backend with python backend? move scheduling logic into trtllm runtime lib. That's why you put python sdk and into the image beforehand.

I can understand that, it is more flexible and extensible (e.g. to support serving enc-dec model or other model arch, to add/adjust some wrapper/IO feature). But this c++ backend will continue to be maintained, right? python backend is not that robust, I used to serve llm with vllm backend, but always encountered this problem. And there also might be a performance drop using python.

Releasing a smaller image for c++ backend, and a full-packaged image for python, can be a goods choice.

regards.

byshiue commented 5 days ago

It is not determined yet. If we plan to deprecate c++ backend, we will make sure the python backend has same performance as c++ backend.

triton-inference-server / tensorrtllm_backend