run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.3k stars 4.66k forks source link

[Bug]: `QueryPipeline` only using the `query_str` of the query transform output `QueryBundle` #14044

Open theta-lin opened 3 weeks ago

theta-lin commented 3 weeks ago

Bug Description

When using a query transform as a component of a QueryPipeline, it effectively does not do anything as the output of such a component is simply the original query_str.

Specifically, I was using HyDEQueryTransform as a part of a QueryPipeline. According to https://github.com/run-llama/llama_index/blob/e4ff32cdedd687c361ec084f0a05859b27318708/llama-index-core/llama_index/core/indices/query/query_transform/base.py#L152-L163 A QueryBundle with custom embedding strings would be its output.

Also, according to https://docs.llamaindex.ai/en/latest/module_guides/querying/pipeline/module_usage/#query-transforms the output of a query transform in a query pipeline is indeed query_str, but this design would effectively drop the custom embedding strings attached to the output QueryBundle.

Version

0.10.38

Steps to Reproduce

Specifically using HyDEQueryTransform, just run the following script with an LLM configured:

from llama_index.core.query_pipeline import QueryPipeline, InputComponent
from llama_index.core.indices.query.query_transform import HyDEQueryTransform

pipeline = QueryPipeline(verbose=True)
pipeline.add_modules(
    {
        "input": InputComponent(),
        # you need to pass in an LLM or configure one in the global settings
        "hyde": HyDEQueryTransform(include_original=False),
    }
)
pipeline.add_link("input", "hyde")

output = pipeline.run(input="the input query string")
print(output)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 weeks ago

The QueryPipeline only uses the query_str of the query transform output QueryBundle because the QueryTransformComponent is designed to return only the query_str from the transformed query bundle. This is evident in the _run_component method of the QueryTransformComponent class, which explicitly returns a dictionary containing only the query_str:

def _run_component(self, **kwargs: Any) -> Any:
    """Run component."""
    output = self._query_transform.run(
        kwargs["query_str"],
        metadata=kwargs["metadata"],
    )
    return {"query_str": output.query_str}

This method does not include the custom_embedding_strs or any other fields from the QueryBundle, effectively ignoring them. This design choice ensures that only the transformed query string is passed along the pipeline, simplifying the handling of query transformations [1][2].

Additionally, in the context of using HyDEQueryTransform, the transformed query string is used for embedding lookup and summarization. The HyDEQueryTransform generates a hypothetical document to improve the embedding quality, but the pipeline still primarily relies on the query_str for the final output. This approach helps in maintaining consistency and simplicity in the query processing pipeline [3].

To continue talking to Dosu, mention @dosu.

ootkin commented 3 weeks ago

@dosu do you suggest a solution to use HyDE as a query pipeline step?

theta-lin commented 3 weeks ago

@ootkin Currently, as a workaround, I wrap around my retriever with a TransformRetriever. Then, you can just use the transform retriever in the pipeline.

retriever = TransformRetriever(
    retriever=retriever,
    query_transform=HyDEQueryTransform(),
)

If you want to use HyDE before passing it to a query engine instead of a retriever, then you can also try TransformQueryEngine.