run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.73k stars 5.05k forks source link

[Bug]: OptimumEmbedding(BaseEmbedding) cannot be selected between CUDA and CPU #15908

Open rushai-dev opened 2 weeks ago

rushai-dev commented 2 weeks ago

Bug Description

https://github.com/run-llama/llama_index/blob/162f5a0523f5a4de33f8cc056ec2b24713d2ee9e/llama-index-integrations/embeddings/llama-index-embeddings-huggingface-optimum/llama_index/embeddings/huggingface_optimum/base.py#L143C9-L143C52

model_output = self._model(**encoded_input)

to

model_output = self._model(**encoded_input).to(self._device)

Version

llama-index-embeddings-huggingface-optimum==0.2.0

Steps to Reproduce

model_output = self._model(**encoded_input).to(self._device)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 weeks ago

Hey @rushai-dev! I'm here to assist you with any bugs, questions, or contributions you have. Let's tackle this together while we wait for a human maintainer.

To ensure the model output is moved to the specified device (e.g., GPU) for processing in the OptimumEmbedding(BaseEmbedding) class, you need to modify the _embed method. Here is the updated method:

def _embed(self, sentences: List[str]) -> List[List[float]]:
    """Embed sentences."""
    encoded_input = self._tokenizer(
        sentences,
        padding=True,
        max_length=self.max_length,
        truncation=True,
        return_tensors="pt",
    )

    # Move encoded input to the specified device
    encoded_input = {key: value.to(self._device) for key, value in encoded_input.items()}

    model_output = self._model(**encoded_input)

    if self.pooling == "cls":
        embeddings = self._cls_pooling(model_output)
    else:
        embeddings = self._mean_pooling(
            model_output, encoded_input["attention_mask"]
        )

    if self.normalize:
        import torch

        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

    return embeddings.tolist()

In this modification, the encoded_input dictionary is updated to move its tensors to the specified device using .to(self._device). This ensures that the model output is processed on the specified device [1][2].

To continue talking to Dosu, mention @dosu.