OpenAIEmbeddings causes CUDA bug

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Working code

from openai import OpenAI
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

models = client.models.list()
model = 'BAAI/bge-en-icl'

responses = client.embeddings.create(
    input=[
        "Hello my name is",
        "The best thing about vLLM is that it supports many different models",
        "annual wellness",
        "What is an Annual Wellness Visit? An Annual Wellness Visit (ANNUAL WELLNESS VISIT) is a yearly appointment with your healthcare provider focused on preventive care."
    ],
    model=model,
)
for data in responses.data:
    # print(data.embedding)  # list of float of len 4096
    print(len(data.embedding))

Non-working code will trigger the vLLM index select error on some tokens

from openai import OpenAI
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
                openai_api_base = "http://localhost:8000/v1",
                openai_api_key = "token-abc123",
                model = 'BAAI/bge-en-icl',
                openai_api_type="openai",
                chunk_size = 1
            )
text = "what is an annual anual visit"
# text = "annual wellness"
text = "annual wellness"
query_result = embeddings.embed_query(text)
print(len(query_result))

Error Message and Stack Trace (if applicable)

../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR 10-10 22:51:09 engine.py:157] RuntimeError('CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`')
ERROR 10-10 22:51:09 engine.py:157] Traceback (most recent call last):
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-10 22:51:09 engine.py:157]     self.run_engine_loop()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-10 22:51:09 engine.py:157]     request_outputs = self.engine_step()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-10 22:51:09 engine.py:157]     raise e
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-10 22:51:09 engine.py:157]     return self.engine.step()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 10-10 22:51:09 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     output = self.model_runner.execute_model(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 22:51:09 engine.py:157]     return func(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/embedding_model_runner.py", line 115, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama_embedding.py", line 41, in forward
ERROR 10-10 22:51:09 engine.py:157]     return self.model.forward(input_ids, positions, kv_caches,
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 10-10 22:51:09 engine.py:157]     hidden_states, residual = layer(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 10-10 22:51:09 engine.py:157]     hidden_states = self.self_attn(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 178, in forward
ERROR 10-10 22:51:09 engine.py:157]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 367, in forward
ERROR 10-10 22:51:09 engine.py:157]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 135, in apply
ERROR 10-10 22:51:09 engine.py:157]     return F.linear(x, layer.weight, bias)
ERROR 10-10 22:51:09 engine.py:157] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
CRITICAL 10-10 22:51:09 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     127.0.0.1:33778 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
...
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Description

I am testing creating embeddings using vLLM endpoint with langchain embedding wrapper. Non-working code based on langchain.OpenAIEmbeddings will trigger CUDA error on the vllm side. The reason I believe there is a bug in langchain OpenAIEmbeddings is that I have both a working code based on OpenAI and non-working code based on langchain. Plus, there is no quantization and parallelization enabled on vLLM side.

To reproduce the error:

install vllm required packages and run vllm serve BAAI/bge-en-icl
Run the two versions scripts above
Working code runs fine on any text input. Non-working code will fail on some token sequences. Here I found that it fail for input text "annual wellness".

System Info

System Information

OS: Linux OS Version: #129~20.04.1-Ubuntu SMP Wed Aug 7 13:07:13 UTC 2024 Python Version: 3.9.20 (main, Oct 3 2024, 07:27:41) [GCC 11.2.0]

Package Information

langchain_core: 0.3.10 langchain: 0.3.3 langchain_community: 0.2.7 langsmith: 0.1.130 langchain_experimental: 0.0.62 langchain_openai: 0.2.2 langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.8 async-timeout: 4.0.3 dataclasses-json: 0.6.7 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.51.0 orjson: 3.10.7 packaging: 24.1 pydantic: 2.7.4 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.35 tenacity: 8.5.0 tiktoken: 0.7.0 typing-extensions: 4.12.2

langchain-ai / langchain