run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.7k stars 5.05k forks source link

[Question]: llama-index-embeddings-huggingface too big? #16199

Open bubl-ai opened 2 days ago

bubl-ai commented 2 days ago

Question Validation

Question

I am working inside a docker container with llama_index. I was noticing that mi image was significantly big, so today I took some time to check the reason for that.

My analysis shows that llama-index-embeddings-huggingface is taking ~12Gb so I am wondering if that is intended. What could I do to install a slimmer version of it?

dosubot[bot] commented 2 days ago

Yes, it is intended for llama-index-embeddings-huggingface to take up a significant amount of space, potentially around ~12GB, due to the size of the models it uses.

To install a slimmer version inside a Docker container, you can opt for a smaller model. For example, you can use the jinaai/jina-embeddings-v2-small-en model instead of a larger one. Here is how you can specify the smaller model in your setup:

# small model
model_name = "jinaai/jina-embeddings-v2-small-en"

# download model locally
embed_model = HuggingFaceEmbedding(
    model_name=model_name, trust_remote_code=True
)

This should help reduce the overall size of the installation [1].

To continue talking to Dosu, mention @dosu.

bubl-ai commented 2 days ago

@dosu that was not my question. My problem is not when loading a model, it is when installing it using pip install

dosubot[bot] commented 2 days ago

To reduce the size of the llama-index-embeddings-huggingface package when installing it via pip, you can use quantized embedding models optimized for Intel processors. This approach leverages the Optimum-Intel library to load smaller and faster quantized models with minimal accuracy loss.

First, install the necessary dependencies:

pip install optimum[exporters] optimum-intel neural-compressor intel_extension_for_pytorch

Then, you can load and use the quantized models with the IntelEmbedding class:

%pip install llama-index-embeddings-huggingface-optimum-intel

from llama_index.embeddings.huggingface_optimum_intel import IntelEmbedding

embed_model = IntelEmbedding("Intel/bge-small-en-v1.5-rag-int8-static")

embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

Additionally, for a more selective installation to reduce the overall package size, you can install only the necessary components:

pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface

This approach helps in reducing the package size and improving performance [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 days ago

Probably due to the torch dependencies? No way around that

I would use a docker image that has torch installed as the base. Or use multi-stage builds. Or limit yourself to one RUN in your dockerfile (all common strategies to reduce size)

bubl-ai commented 2 days ago

@logan-markewich I already tried most of those approaches. The size just explodes anytime I include llama-index-embeddings-huggingface in the equation.

I don't mind the build taking longer... what is important to me is the final size as I am pushing that image to a remote registry and using it there and the size just makes it difficult to work with.

logan-markewich commented 1 day ago

I can't really suggest more than what I've mentioned above -- the only reason the image would be large is either due to installing torch (and associated cuda dependencies), which you can solve by using a PyTorch+CUDA base image, or by using multistage builds