huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.88k stars 184 forks source link

Nulls instead of vector for Alibaba-NLP/gte-multilingual-base on T4 GPU #439

Open superchar opened 4 days ago

superchar commented 4 days ago

System Info

Model - Alibaba-NLP/gte-multilingual-base Image - text-embeddings-inference:turing-1.5 Azure VM - Standard_NC4as_T4_v3 GPU - Nvidia Tesla T4 AKS version - 1.28.14 OS - Ubuntu 22.04 Command -

command: ["text-embeddings-router"]  
args: 
[ 
      "--model-id", "Alibaba-NLP/gte-multilingual-base",
      "--port", "8080",
      "--max-client-batch-size", "2000",
      "--payload-limit", "200000000",
      "--max-batch-tokens", "260000",
      "--revision", "refs/pr/7",
      "--auto-truncate"
]

Information

Tasks

Reproduction

When executing the following request the first time:

POST /v1/embeddings
{
 "input":  "test",
 "model": "Alibaba-NLP/gte-multilingual-base"
}

The response is following

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                -0.055719655,
                0.06356562,
                -0.030253513
                ......................
            ],
            "index": 0
        }
    ],
    "model": "Alibaba-NLP/gte-multilingual-base",
    "usage": {
        "prompt_tokens": 3,
        "total_tokens": 3
    }
}

However, when repeating the same request the second time, I am getting:

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [
                null,
                null,
                null
                ......................            
             ],
            "index": 0
        }
    ],
    "model": "Alibaba-NLP/gte-multilingual-base",
    "usage": {
        "prompt_tokens": 3,
        "total_tokens": 3
    }
}

I tried setting USE_FLASH_ATTENTION=False, however, it seems that this env variable is ignored for GTE models. I understand that Turing support is marked as experimental, but is there any way to run this on T4 with or without Flash Attention v1?

Expected behavior

Do not get nulls instead of vector.

kozistr commented 3 days ago

@superchar hi. I guess you can disable flash attention by setting dtype to float32 instead of float16 in general. however, afaik, currently, there's only Flash GTE implementation, which doesn't support CPU or GPU w/o flash attn. maybe we could implement one.