run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.32k stars 5.18k forks source link

[Bug]: Nvidia NIM Reranker HTTPError: 400 Client Error #15273

Closed vecorro closed 1 month ago

vecorro commented 2 months ago

Bug Description

I'm trying to use a reranker NIM to replace a HiggingFace model. The NIM service works properly from curl. Example:

more test_re-ranker.sh 
curl -X "POST" \
  "http://10.237.119.242:8010/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"model": "nvidia/nv-rerankqa-mistral-4b-v3",
"query": {"text": "which way should i go?"},
"passages": [
{"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as f
ar as i could to where it bent in the undergrowth;"},
{"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for th
at the passing there had worn them really about the same,"},
{"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how wa
y leads on to way i doubted if i should ever come back."},
{"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less trave
led by, and that has made all the difference."}
],
"truncate": "END"
}'

$ bash test_re-ranker.sh 
{"rankings":[{"index":0,"logit":0.7646484375},{"index":3,"logit":-1.1044921875},{"index":2,"logit":-2.71875},{"index":1,"logit":-5.09765625}]}

When I try this reranker, I get HTTPError: 400 Client Error: Bad Request for url: http://10.237.119.242:8010/v1/ranking

Version

llama-index 0.10.58 / llama-index-postprocessor-nvidia-rerank 0.1.5

Steps to Reproduce

db_cfg = config.postgresql
connection_string = (f"postgresql://{db_cfg.user}:"
                     f"{db_cfg.password}@{db_cfg.db_host}:{db_cfg.port}/{db_cfg.default_db}")
url = make_url(connection_string)

# Open the connection to the Vector Store
vector_store = PGVectorStore.from_params(
    database=url.database,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=db_cfg.tables.std_rag,
    embed_dim=EMBEDDING_SIZE, # embedding model dimension
    cache_ok=True,
    hybrid_search=True,
)

# Initialize the index object
index = VectorStoreIndex.from_vector_store(vector_store)

# Initialize the re-ranker of retrieved chunks
rerank_cfg = config.ml_models.re_ranker
re_ranker = NVIDIARerank(
        model=rerank_cfg.model,
        base_url=rerank_cfg.api_base,
        top_n=rerank_cfg.rerank_top_n,
)

# Set the index as query engine
h_search = ast.literal_eval(db_cfg.pgvector.pgvector_kwargs)
query_engine = index.as_query_engine(
    similarity_top_k=db_cfg.pgvector.sim_top_k,
    node_postprocessors=[re_ranker],
    vector_store_kwargs=h_search,
    response_mode=config.llama_index.std_rag.response_mode
)

test_set_cfg = config.data_files.test_sets
active_set = eval(f"test_set_cfg.{test_set_cfg.active_set}.current")
test_set_dir = test_set_cfg.base_path + active_set
test_set = os.listdir(test_set_dir)[0]
test_set = test_set_dir + test_set

test_set_df = pd.read_csv(
        filepath_or_buffer=test_set,
        usecols=['query', 'reference_answer'],
        nrows=test_set_cfg.sample_size
)

# Run inference on the Q/A pairs and keep the responses
# for future comparison vs.other RAG approaches.
responses = generate_responses_dict(
    query_engine=query_engine,
    test_set_df=test_set_df
)

Relevant Logs/Tracbacks

HTTPError                                 Traceback (most recent call last)
File <timed exec>:16

File ~/DataspellProjects/VMware-generative-ai-reference-architecture/Starter-Packs/Improved_RAG/05-RAG_Variants/01-Simple_Retrieval/../../utils/helpers.py:112, in generate_responses_dict(query_engine, test_set_df)
    101 def generate_responses_dict(query_engine, test_set_df):
    102     """
    103     Returns a dictionary containing the responses generated by 'query_engine'
    104     to the queries provided by the test_set_df dataframe. The dictionary also
   (...)
    110              evaluation frameworks like DeepEval
    111     """
--> 112     responses = [query_engine.query(q)
    113                  for q in tqdm(test_set_df['query'].to_list())]
    114     answers = []
    115     contexts = []

File ~/DataspellProjects/VMware-generative-ai-reference-architecture/Starter-Packs/Improved_RAG/05-RAG_Variants/01-Simple_Retrieval/../../utils/helpers.py:112, in <listcomp>(.0)
    101 def generate_responses_dict(query_engine, test_set_df):
    102     """
    103     Returns a dictionary containing the responses generated by 'query_engine'
    104     to the queries provided by the test_set_df dataframe. The dictionary also
   (...)
    110              evaluation frameworks like DeepEval
    111     """
--> 112     responses = [query_engine.query(q)
    113                  for q in tqdm(test_set_df['query'].to_list())]
    114     answers = []
    115     contexts = []

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:230, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    226 self.span_enter(
    227     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    228 )
    229 try:
--> 230     result = func(*args, **kwargs)
    231 except BaseException as e:
    232     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/base/base_query_engine.py:52, in BaseQueryEngine.query(self, str_or_query_bundle)
     50     if isinstance(str_or_query_bundle, str):
     51         str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 52     query_result = self._query(str_or_query_bundle)
     53 dispatcher.event(
     54     QueryEndEvent(query=str_or_query_bundle, response=query_result)
     55 )
     56 return query_result

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:230, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    226 self.span_enter(
    227     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    228 )
    229 try:
--> 230     result = func(*args, **kwargs)
    231 except BaseException as e:
    232     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py:189, in RetrieverQueryEngine._query(self, query_bundle)
    185 """Answer a query."""
    186 with self.callback_manager.event(
    187     CBEventType.QUERY, payload={EventPayload.QUERY_STR: query_bundle.query_str}
    188 ) as query_event:
--> 189     nodes = self.retrieve(query_bundle)
    190     response = self._response_synthesizer.synthesize(
    191         query=query_bundle,
    192         nodes=nodes,
    193     )
    194     query_event.on_end(payload={EventPayload.RESPONSE: response})

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py:145, in RetrieverQueryEngine.retrieve(self, query_bundle)
    143 def retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    144     nodes = self._retriever.retrieve(query_bundle)
--> 145     return self._apply_node_postprocessors(nodes, query_bundle=query_bundle)

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py:138, in RetrieverQueryEngine._apply_node_postprocessors(self, nodes, query_bundle)
    134 def _apply_node_postprocessors(
    135     self, nodes: List[NodeWithScore], query_bundle: QueryBundle
    136 ) -> List[NodeWithScore]:
    137     for node_postprocessor in self._node_postprocessors:
--> 138         nodes = node_postprocessor.postprocess_nodes(
    139             nodes, query_bundle=query_bundle
    140         )
    141     return nodes

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/postprocessor/types.py:56, in BaseNodePostprocessor.postprocess_nodes(self, nodes, query_bundle, query_str)
     54 else:
     55     pass
---> 56 return self._postprocess_nodes(nodes, query_bundle)

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:230, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    226 self.span_enter(
    227     id_=id_, bound_args=bound_args, instance=instance, parent_id=parent_id
    228 )
    229 try:
--> 230     result = func(*args, **kwargs)
    231 except BaseException as e:
    232     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/llama_index/postprocessor/nvidia_rerank/base.py:234, in NVIDIARerank._postprocess_nodes(self, nodes, query_bundle)
    232     url += "/ranking"
    233 response = session.post(url, headers=_headers, json=payloads)
--> 234 response.raise_for_status()
    235 # expected response format:
    236 # {
    237 #     "rankings": [
   (...)
    243 #     ]
    244 # }
    245 assert (
    246     "rankings" in response.json()
    247 ), "Response does not contain expected 'rankings' key"

File ~/miniconda3/envs/impv_rag/lib/python3.11/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 400 Client Error: Bad Request for url: http://10.237.119.242:8010/v1/ranking
dosubot[bot] commented 2 months ago

To resolve the HTTPError: 400 Client Error: Bad Request for url: http://10.237.119.242:8010/v1/ranking issue when using the Nvidia NIM Reranker in your Python code, ensure that the URL and payload format match the expected format for the on-premises NIM service.

Here is a checklist and example code to help you troubleshoot and resolve the issue:

  1. Check the URL Format: Ensure the URL is correctly formatted and ends with /v1/ranking.
  2. Payload Structure: Verify that the payload structure matches the expected format.
  3. Headers: Ensure the headers are correctly set, especially the Authorization header if required.

Here is an example of how you can structure your request:

import requests

class NVIDIARerank:
    def __init__(self, base_url, api_key=None):
        self._base_url = base_url
        self._api_key = api_key

    def rerank(self, query_str, nodes):
        session = requests.Session()
        headers = {
            "Authorization": f"Bearer {self._api_key}",
            "Accept": "application/json",
        }
        payload = {
            "model": "nv-rerank-qa-mistral-4b:1",
            "query": {"text": query_str},
            "passages": [{"text": node} for node in nodes],
        }
        url = f"{self._base_url}/ranking"
        response = session.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()

# Example usage
base_url = "http://10.237.119.242:8010/v1"
api_key = "your_api_key"  # If required
reranker = NVIDIARerank(base_url, api_key)
query_str = "which way should i go?"
nodes = [
    "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;",
    "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,",
    "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back.",
    "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."
]

try:
    result = reranker.rerank(query_str, nodes)
    print(result)
except requests.HTTPError as e:
    print(f"HTTP error occurred: {e}")

Ensure that:

If the service works correctly with a curl command, compare the curl request with the Python request to ensure they are identical in terms of URL, headers, and payload.

Additionally, you can refer to the NVIDIA NIMs documentation for more details on setting up and troubleshooting local microservice instances. This documentation provides comprehensive guidance on working with NVIDIA NIMs, both hosted and local instances [1][2].

To continue talking to Dosu, mention @dosu.

vecorro commented 2 months ago

The documentation dosubot is recommending is wrong, for instance this cant be right:

from llama_index.llms.nvidia import NVIDIA

# connect to a rerank NIM running at localhost:1976
reranker = NVIDIARerank(base_url="http://localhost:1976/v1")

The LlamaIndex documentation about the NIM class needs revision, perhaps the implementation of that module also requires corrections.

dosubot[bot] commented 2 months ago

To address the issues with the Nvidia NIM Reranker and ensure the documentation and implementation are correct, let's focus on the key points:

Correct Usage of NVIDIARerank Class

  1. Initialization: Ensure you initialize the NVIDIARerank class with the correct parameters, including the API key and base URL if needed.

    from llama_index.postprocessor.nvidia_rerank import NVIDIARerank
    
    reranker = NVIDIARerank(
       model="nv-rerank-qa-mistral-4b:1",
       nvidia_api_key="your_api_key",
       base_url="http://10.237.119.242:8010/v1",
       top_n=5,
       max_batch_size=64
    )
  2. Payload Structure: The payload structure expected by the Nvidia NIM Reranker API should be as follows:

    {
       "model": "nv-rerank-qa-mistral-4b:1",
       "query": {
           "text": "which way should i go?"
       },
       "passages": [
           {
               "text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"
           },
           {
               "text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"
           },
           {
               "text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."
           },
           {
               "text": "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."
           }
       ]
    }
  3. Example Code: Here is an example of how to use the NVIDIARerank class correctly:

    import requests
    
    class NVIDIARerank:
       def __init__(self, base_url, api_key=None):
           self._base_url = base_url
           self._api_key = api_key
    
       def rerank(self, query_str, nodes):
           session = requests.Session()
           headers = {
               "Authorization": f"Bearer {self._api_key}",
               "Accept": "application/json",
           }
           payload = {
               "model": "nv-rerank-qa-mistral-4b:1",
               "query": {"text": query_str},
               "passages": [{"text": node} for node in nodes],
           }
           url = f"{self._base_url}/ranking"
           response = session.post(url, headers=headers, json=payload)
           response.raise_for_status()
           return response.json()
    
    # Example usage
    base_url = "http://10.237.119.242:8010/v1"
    api_key = "your_api_key"  # If required
    reranker = NVIDIARerank(base_url, api_key)
    query_str = "which way should i go?"
    nodes = [
       "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;",
       "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,",
       "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back.",
       "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."
    ]
    
    try:
       result = reranker.rerank(query_str, nodes)
       print(result)
    except requests.HTTPError as e:
       print(f"HTTP error occurred: {e}")

Documentation and Implementation

The NVIDIARerank class should be well-documented, including detailed descriptions of its parameters and methods. The implementation should include validation and error handling to ensure correct usage.

Key Points to Check:

  1. API Key: Ensure that the API key is correctly set and passed in the headers.
  2. Payload Format: Verify that the payload structure matches the API's expected format.
  3. URL: Ensure the URL is correct and matches the API endpoint.

By following these guidelines and ensuring the correct setup, you should be able to resolve the HTTPError: 400 Client Error: Bad Request issue and use the NVIDIARerank class effectively [1][2][3].

It looks like we need your specialized knowledge to address the documentation concerns and the implementation details, @logan-markewich.

To continue talking to Dosu, mention @dosu.

vecorro commented 2 months ago

There must be something wrong with either the NIM client implementation or the documentation I have double-checked my code and all looks the way @dosu indicates"

re_ranker = NVIDIARerank(
        model="nvidia/nv-rerankqa-mistral-4b-v3", # rerank_cfg.model,
        base_url="http://10.237.119.242:8010/v1",# rerank_cfg.api_base,
        top_n=4 # rerank_cfg.rerank_top_n,
)

and I still get the same error.

dosubot[bot] commented 2 months ago

The HTTPError: 400 Client Error: Bad Request issue could be due to several factors. Here are some specific points to check in the NVIDIARerank implementation and usage:

  1. API Key: Ensure the API key is correctly set and included in the headers. If the API key is missing or incorrect, it can cause a 400 error.

  2. URL Path: Verify that the URL path is correct. For on-premises NIM, the path should be /v1/ranking.

  3. Payload Structure: Ensure the payload is correctly structured. The payload should include the model, query, and passages.

  4. Headers: Ensure the headers include the correct authorization token.

Here is a checklist and example code to help you troubleshoot:

Checklist

  1. API Key:

    self._api_key = get_from_param_or_env(
       "api_key",
       nvidia_api_key or api_key,
       "NVIDIA_API_KEY",
       "NO_API_KEY_PROVIDED",
    )
  2. URL Path:

    url = self._base_url
    if self._is_hosted:
       url += "/retrieval/nvidia/reranking"
    else:
       url += "/ranking"
  3. Payload Structure:

    payloads = {
       "model": self.model,
       "query": {"text": query_bundle.query_str},
       "passages": [
           {"text": n.get_content(metadata_mode=MetadataMode.EMBED)}
           for n in batch
       ],
    }
  4. Headers:

    _headers = {
       "Authorization": f"Bearer {self._api_key}",
       "Accept": "application/json",
    }

Example Code

Here is an example of how to use the NVIDIARerank class correctly:

import requests

class NVIDIARerank:
    def __init__(self, model, base_url, nvidia_api_key=None, top_n=4):
        self.model = model
        self._base_url = base_url
        self._api_key = nvidia_api_key
        self.top_n = top_n

    def rerank(self, query_str, nodes):
        session = requests.Session()
        headers = {
            "Authorization": f"Bearer {self._api_key}",
            "Accept": "application/json",
        }
        payload = {
            "model": self.model,
            "query": {"text": query_str},
            "passages": [{"text": node} for node in nodes],
        }
        url = f"{self._base_url}/ranking"
        response = session.post(url, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()

# Example usage
base_url = "http://10.237.119.242:8010/v1"
api_key = "your_api_key"  # If required
reranker = NVIDIARerank(
    model="nvidia/nv-rerankqa-mistral-4b-v3",
    base_url=base_url,
    nvidia_api_key=api_key,
    top_n=4
)
query_str = "which way should i go?"
nodes = [
    "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;",
    "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,",
    "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back.",
    "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."
]

try:
    result = reranker.rerank(query_str, nodes)
    print(result)
except requests.HTTPError as e:
    print(f"HTTP error occurred: {e}")

Ensure that:

If the issue persists, double-check the payload and headers against the working curl command to ensure they are identical [1].

To continue talking to Dosu, mention @dosu.

vecorro commented 2 months ago

Hi, @logan-markewich. I reproduced the following tutorial from Nvidia: https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/playbook.html. My local NIM setup works properly when using the LangChain re-ranker so there must be something wrong with LlamaIndex's from llama_index.postprocessor.nvidia_rerank import NVIDIARerank.

Thanks

Enrique

vecorro commented 2 months ago

Hi,

FYI @logan-markewich and @dosu. I watched the traces from the re-ranker NIM and this is the error:

2024-08-15T18:21:40Z ERROR: root - Uncaught InferenceServerException: [StatusCode.INTERNAL] in ensemble 'nvidia_nv_rerankqa_mistral_4b_v3', Input length 1024 exceeds maximum allowed token size 512; Request: <starlette.requests.Request object at 0x7f646014d180>

The solution would be to let the NVIDIARerank initialization pass the truncate flag to the re-ranker just like it is possible to do with the NVIDIAEmbedding module.

I tried this fix at NVIDIAReran's base.py module. I made a couple of additions (commented as vecorro):.

1) Added an extra header as indicated in Nvidia's documentation:

_headers = {
            "Authorization": f"Bearer {self._api_key}",
            "Accept": "application/json",
            'Content-Type': 'application/json', # vecorro

        }

2) Added the truncated value at the end of the payload:

payloads = {
                    "model": self.model,
                    "query": {"text": query_bundle.query_str},
                    "passages": [
                        {"text": n.get_content(metadata_mode=MetadataMode.EMBED)}
                        for n in batch
                    ],
                    "truncate": "END", # vecorro
                }

Given the small token limit supported by most Nvidia models shipped with embedding and re-ranking NIMs (512), I think it'd be good that LlamaIndex's integration modules define the {"truncate": "END:} as the default so things work correctly.

I'm part of the VMware/Broadcom team working with customers to implement our Private AI solutions with Nvidia, and it would help those customers fond of LlamaIndex get this fix implemented. Thanks

dosubot[bot] commented 2 months ago

To address the issue with the NVIDIARerank class and ensure it handles the truncate flag correctly, you can update the class to include the truncate parameter and pass it to the re-ranker. Here is how you can modify the NVIDIARerank class:

  1. Add the truncate field to the NVIDIARerank class.
  2. Update the methods to pass the truncate flag to the re-ranker.

Here is the updated code for the NVIDIARerank class:

class NVIDIARerank(BaseNodePostprocessor):
    """NVIDIA's API Catalog Reranker Connector."""

    class Config:
        validate_assignment = True

    model: Optional[str] = Field(
        default=DEFAULT_MODEL,
        description="The NVIDIA API Catalog reranker to use.",
    )
    top_n: Optional[int] = Field(
        default=5,
        ge=0,
        description="The number of nodes to return.",
    )
    max_batch_size: Optional[int] = Field(
        default=64,
        ge=1,
        description="The maximum batch size supported by the inference server.",
    )
    truncate: Literal["NONE", "START", "END"] = Field(
        default="NONE",
        description=(
            "Truncate input text if it exceeds the model's maximum token length. "
            "Default is 'NONE', which raises an error if an input is too long."
        ),
    )
    _api_key: str = PrivateAttr("NO_API_KEY_PROVIDED")  # TODO: should be SecretStr
    _mode: str = PrivateAttr("nvidia")
    _is_hosted: bool = PrivateAttr(True)
    _base_url: str = PrivateAttr(BASE_URL)

    def _set_api_key(self, nvidia_api_key: str = None, api_key: str = None) -> None:
        self._api_key = get_from_param_or_env(
            "api_key",
            nvidia_api_key or api_key,
            "NVIDIA_API_KEY",
            "NO_API_KEY_PROVIDED",
        )

    def __init__(
        self,
        model: str = DEFAULT_MODEL,
        nvidia_api_key: Optional[str] = None,
        api_key: Optional[str] = None,
        base_url: Optional[str] = None,
        **kwargs: Any,
    ):
        """
        Initialize a NVIDIARerank instance.

        This class provides access to a NVIDIA NIM for reranking. By default, it connects to a hosted NIM, but can be configured to connect to an on-premises NIM using the `base_url` parameter. An API key is required for hosted NIM.

        Args:
            model (str): The model to use for reranking.
            nvidia_api_key (str, optional): The NVIDIA API key. Defaults to None.
            api_key (str, optional): The API key. Defaults to None.
            base_url (str, optional): The base URL of the on-premises NIM. Defaults to None.
            **kwargs: Additional keyword arguments.
        """
        super().__init__(model=model, **kwargs)

        if base_url is None or base_url in MODEL_ENDPOINT_MAP.values():
            base_url = MODEL_ENDPOINT_MAP.get(model, BASE_URL)
        else:
            base_url = self._validate_url(base_url)

        self._api_key = get_from_param_or_env(
            "api_key",
            nvidia_api_key or api_key,
            "NVIDIA_API_KEY",
            "NO_API_KEY_PROVIDED",
        )

        self._is_hosted = self._base_url in KNOWN_URLS

        if self._is_hosted and self._api_key == "NO_API_KEY_PROVIDED":
            warnings.warn(
                "An API key is required for hosted NIM. This will become an error in 0.2.0."
            )

    def _validate_url(self, base_url):
        """
        Base URL Validation.
        ValueError : url which do not have valid scheme and netloc.
        Warning : v1/rankings routes.
        ValueError : Any other routes other than above.
        """
        expected_format = "Expected format is 'http://host:port'."
        result = urlparse(base_url)
        if not (result.scheme and result.netloc):
            raise ValueError(
                f"Invalid base_url, Expected format is 'http://host:port': {base_url}"
            )
        if result.path:
            normalized_path = result.path.strip("/")
            if normalized_path == "v1":
                pass
            elif normalized_path == "v1/rankings":
                warnings.warn(f"{expected_format} Rest is Ignored.")
            else:
                raise ValueError(f"Base URL path is not recognized. {expected_format}")
        return urlunparse((result.scheme, result.netloc, "v1", "", "", ""))

    @property
    def available_models(self) -> List[Model]:
        """Get available models."""
        # all available models are in the map
        ids = MODEL_ENDPOINT_MAP.keys()
        return [Model(id=id) for id in ids]

    def rerank(self, query: str, documents: List[str]) -> List[float]:
        """Re-rank documents based on the query."""
        session = requests.Session()
        _headers = {
            "Authorization": f"Bearer {self._api_key}",
            "Accept": "application/json",
            'Content-Type': 'application/json',  # vecorro
        }
        payloads = {
            "model": self.model,
            "query": {"text": query},
            "passages": [{"text": doc} for doc in documents],
            "truncate": self.truncate,  # vecorro
        }
        url = self._base_url
        if self._is_hosted:
            if url.endswith("/v1"):
                url += "/retrieval/nvidia/reranking"
        else:
            url += "/ranking"
        response = session.post(url, headers=_headers, json=payloads)
        response.raise_for_status()
        return [result["logit"] for result in response.json()["rankings"]]

    async def arerank(self, query: str, documents: List[str]) -> List[float]:
        """Asynchronously re-rank documents based on the query."""
        async with aiohttp.ClientSession() as session:
            _headers = {
                "Authorization": f"Bearer {self._api_key}",
                "Accept": "application/json",
                'Content-Type': 'application/json',  # vecorro
            }
            payloads = {
                "model": self.model,
                "query": {"text": query},
                "passages": [{"text": doc} for doc in documents],
                "truncate": self.truncate,  # vecorro
            }
            url = self._base_url
            if self._is_hosted:
                if url.endswith("/v1"):
                    url += "/retrieval/nvidia/reranking"
            else:
                url += "/ranking"
            async with session.post(url, headers=_headers, json=payloads) as response:
                response.raise_for_status()
                return [result["logit"] for result in await response.json()["rankings"]]

This code adds the truncate field to the NVIDIARerank class and updates the rerank and arerank methods to pass the truncate flag to the re-ranker. This ensures that the input length exceeding the maximum allowed token size is handled appropriately, similar to the NVIDIAEmbedding module [1][2].

A maintainer has already been looped in to address the documentation and implementation concerns.

To continue talking to Dosu, mention @dosu.

vecorro commented 1 month ago

The bug was fixed in llama-index-postprocessor-nvidia-rerank==0.3.1