huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.65k stars 166 forks source link

[cpu][python backend]crash in python backend #344

Closed zhuhaozhe closed 2 months ago

zhuhaozhe commented 2 months ago

System Info

test-embeddings-inference==v1.5.0 python==3.9 Run with CPU device

Information

Tasks

Reproduction

Follow https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#local-install to but with "-F python" and start service with

model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
./target/release/text-embeddings-router --model-id $model --revision $revision --port 8080 --dtype float32

Expected behavior

    cpu_results = embedding.view(-1).tolist()
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

More explanations

I am trying to use the python backends and I met some error because of these two lines: https://github.com/huggingface/text-embeddings-inference/blob/661a77ffba48f92fccda8c7b7302f6a973570016/backends/python/server/text_embeddings_server/models/default_model.py#L44-L45

I inserted some logs to see the shapes:

        logger.info(f"Python backend: input_ids {kwargs['input_ids'].shape}")
        output = self.model(**kwargs)
        logger.info(f"Python backend: output {output[0].shape}")
        embedding = output[0][:, 0]
        logger.info(f"Python backend: embedding {embedding.shape}")
        cpu_results = embedding.view(-1).tolist()
2024-07-15T08:32:45.772527Z  INFO text_embeddings_router: router/src/lib.rs:257: Warming up model
2024-07-15T08:32:45.778553Z  INFO python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:37: Python backend: input_ids torch.Size([32, 512])

2024-07-15T08:32:50.829333Z  INFO python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:37: Python backend: output torch.Size([32, 512, 1024])

2024-07-15T08:32:50.829401Z  INFO python-backend: text_embeddings_backend_python::logging: backends/python/src/logging.rs:37: Python backend: embedding torch.Size([32, 1024])

I am an newbee for text embedding. From my understanding here, the batch size should be 32, seq_len should be 512, I wish to understand what is expected output embeding shapes? The output should be torch.Size([32, 512, 1024]) This slice means we only choose the first embedding for 1 batch? Does this expected? We have 512 tokens per batch but we only want the first token's embeddings?

embedding = output[0][:, 0]

If it is expected, we may just modify

cpu_results = embedding.view(-1).tolist()

to reshape from the suggest in the error msg/

If it is not expected, should we use embedding = output[0] instead of embedding = output[0][:, 0] and also correct

        return [
            Embedding(
                values=cpu_results[i * self.hidden_size : (i + 1) * self.hidden_size]
            )
            for i in range(len(batch))
        ]
zhuhaozhe commented 2 months ago

With this example usage

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

The input_id's shape is [1, 7] which means BS=1, seq_len=7, hidden_size=1024 Do we expected output embeddings shape=[1024] or [7168](7 * 1024)?

OlivierDehaene commented 2 months ago

We have 512 tokens per batch but we only want the first token's embeddings?

Yes, that's called class pooling.

It should be a reshape instead of a view yes. Reshape forces the tensor to be contiguous.

zhuhaozhe commented 2 months ago

We have 512 tokens per batch but we only want the first token's embeddings?

Yes, that's called class pooling.

It should be a reshape instead of a view yes. Reshape forces the tensor to be contiguous.

@OlivierDehaene, thanks!