[BUG] 50 documents returned rather than `top_k` for `VectorDB` Tool using AI Search Connection

microsoft / promptflow

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.

https://microsoft.github.io/promptflow/

MIT License

8.43k stars 725 forks source link

[BUG] 50 documents returned rather than `top_k` for `VectorDB` Tool using AI Search Connection #1843

Closed msha1026 closed 4 months ago

msha1026 commented 4 months ago

Describe the bug In AzureML prompt flow studio, using the VectorDB lookup tool and connecting to AI Search, I set the top_k input value to 1. However, after a run of the flow, the output of the VectorDB lookup node shows 50 documents returned instead. I have tried various values for top_k but always receive 50 documents back.

How To Reproduce the bug Steps to reproduce the behavior, how frequent can you experience the bug:

Create a chat flow with the following structure:

id: bring_your_own_data_chat_qna
name: Copilot Chat
inputs:
chat_history:
  type: list
  default:
  - role: system
    content: You are an AI assistant that helps people find information.
  - role: user
    content: hi there this is a test
  - role: assistant
    content: Hello! How can I assist you today?
  is_chat_input: false
chat_input:
  type: string
  default: What is the capital of Germany?
  is_chat_input: false
outputs: {}
nodes:
- name: embed_the_question
type: python
source:
  type: package
  tool: promptflow.tools.embedding.embedding
inputs:
  connection: openai_connection
  deployment_name: text-embedding-ada-002
  input: ${inputs.chat_input}
use_variants: false
- name: azure_ai_search_lookup
type: python
source:
  type: package
  tool: promptflow_vectordb.tool.vector_db_lookup.VectorDBLookup.search
inputs:
  connection: ai_search_connection
  index_name: chunked
  text_field: text
  vector_field: embedding
  search_params:
    search: What is the capital of Germany?
  search_filters:
    filter: ""
  vector: ${embed_the_question.output}
  top_k: 1
use_variants: false
node_variants: {}
environment:
python_requirements_txt: requirements.txt

Run the flow and review the output of the azure_ai_search_lookup node
Notice that 50 documents are returned instead of 1(top_k) Expected behavior Only top_k amount of documents are returned from azure_ai_search_lookup node

Screenshots Small snippet of the array of documents returned

Running Information(please complete the following information):

Azure ML prompt flow using automatic runtime

Additional context N/A

D-W- commented 4 months ago

Hi @dans-msft @Adarsh-Ramanathan , please help to take a look of this issue. @msha1026 we currently only track prompt flow SDK/CLI issues here. For portal UI bugs, please create an OCV in portal here:

Adarsh-Ramanathan commented 4 months ago

@msha1026, Vector DB Lookup is on the deprecation path, could you upgrade your flow to its replacement - the preview Index Lookup tool (not Vector Index Lookup, which is also on the same deprecation path), and let us know if the issue persists?

https://learn.microsoft.com/en-us/azure/ai-studio/how-to/prompt-flow-tools/index-lookup-tool

msha1026 commented 4 months ago

Thank you for the speedy replies!

I have tried the preview Index Lookup tool in AzureML studio, and that did resolve the issue. Thank you!

@Adarsh-Ramanathan Is this preview tool also supported in VSCode? If so, what packages need to be installed in order to also run this in the VSCode prompt flow extension?

For reference, this is my local dev environment: VSCode prompt flow extension: 1.9.2 promptflow: 1.4.1 promptflow-tools: 1.1.0 promptflow-vectordb: 0.2.3

Adarsh-Ramanathan commented 4 months ago

@msha1026, you might need to install pymongo as well - this should be an extra, but at present, is a required dependency. This is a known bug that will be fixed with the upcoming release. If you run pf tool list in your env, you should see a ModuleNotFoundError in the output complaining that pymongo wasn't found.

msha1026 commented 4 months ago

@Adarsh-Ramanathan is there a timeline on when the Vector DB Lookup tool will be officially deprecated? Is there also a timeline on the upcoming release for the dependency fix?

Also, thanks for the heads up on missing pymongo. I was also missing azureml-rag[search-documents]. This was the package list that I had to install in order to get the tool working in VSCode in case anyone has the same issues as I did:

promptflow[azure]
promptflow-tools
promptflow-vectordb
azureml-rag[cognitive-search]
pymongo

and my pip freeze resulted in the following:

aiohttp==3.9.1
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.2.0
async-timeout==4.0.3
attrs==23.2.0
azure-ai-ml==1.12.1
azure-common==1.1.28
azure-core==1.29.7
azure-identity==1.15.0
azure-mgmt-core==1.4.0
azure-search-documents==11.4.0b8
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
azure-storage-file-share==12.15.0
azureml-dataprep==5.1.3
azureml-dataprep-native==41.0.0
azureml-dataprep-rslex==2.22.2
azureml-fsspec==1.3.0
azureml-rag==0.2.24.1
blinker==1.7.0
cachetools==5.3.2
cattrs==23.2.3
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
cryptography==41.0.7
dataclasses-json==0.6.3
distro==1.9.0
dnspython==2.5.0
docutils==0.20.1
exceptiongroup==1.2.0
faiss-cpu==1.7.4
filelock==3.13.1
filetype==1.2.0
Flask==3.0.1
frozenlist==1.4.1
fsspec==2023.12.2
gitdb==4.0.11
GitPython==3.1.41
google-api-core==2.15.0
google-auth==2.27.0
google-search-results==2.4.1
googleapis-common-protos==1.62.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.2
httpx==0.26.0
idna==3.6
importlib-metadata==7.0.1
isodate==0.6.1
itsdangerous==2.1.2
jaraco.classes==3.3.0
Jinja2==3.1.3
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
keyring==24.3.0
langchain==0.0.348
langchain-core==0.0.13
langsmith==0.0.83
MarkupSafe==2.1.4
marshmallow==3.20.2
mmh3==4.1.0
more-itertools==10.2.0
msal==1.26.0
msal-extensions==1.1.0
msrest==0.7.1
multidict==6.0.4
mypy-extensions==1.0.0
numpy==1.26.3
oauthlib==3.2.2
openai==1.10.0
opencensus==0.11.4
opencensus-context==0.1.3
opencensus-ext-azure==1.1.13
packaging==23.2
pandas==2.2.0
pillow==10.2.0
platformdirs==4.1.0
portalocker==2.8.2
promptflow==1.4.1
promptflow-tools==1.1.0
promptflow_vectordb==0.2.3
protobuf==4.25.2
psutil==5.9.8
pyarrow==14.0.2
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.3
pydantic_core==2.14.6
pydash==7.0.5
PyJWT==2.8.0
pymongo==4.6.1
python-dateutil==2.8.2
python-dotenv==1.0.1
pytz==2023.3.post1
pywin32==306
pywin32-ctypes==0.2.2
PyYAML==6.0.1
referencing==0.32.1
regex==2023.12.25
requests==2.31.0
requests-cache==1.1.1
requests-oauthlib==1.3.1
rpds-py==0.17.1
rsa==4.9
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.8
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
SQLAlchemy==2.0.25
strictyaml==1.7.3
tabulate==0.9.0
tenacity==8.2.3
tiktoken==0.5.2
tqdm==4.66.1
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2023.4
url-normalize==1.4.3
urllib3==2.1.0
waitress==2.1.2
Werkzeug==3.0.1
yarl==1.9.4
zipp==3.17.0

Adarsh-Ramanathan commented 4 months ago

@msha1026, the tools will be marked as deprecated with the next package release, although I don't have an exact timeline on decommissioning them altogether. The next release should be out this week, and that'll include the fix for the pymongo dependency.

WRT the other dependencies, they're bundled with the azure extra for promptflow-vectordb, if you install that extra, you should get all of those as transitive requirements.

msha1026 commented 4 months ago

Awesome thank you for the details @Adarsh-Ramanathan. And yes, you are right. The azure extra on promptflow-vectordb does install the transitive requirements. I will close out this issue then since the preview Index lookup tool works for us locally and in the studio.

jayendranarumugam commented 4 months ago

Hi @msha1026 , When I tried the new preview Index I get the error like

promptflow._utils.tool_utils.DynamicListError: Unable to display list of items due to 'Error when calling function promptflow_vectordb.tool.common_index_lookup_utils.list_available_query_types: tool_ui_callback.<locals>.wrapped() missing 3 required positional arguments: 'subscription_id', 'resource_group_name', and 'workspace_name''. Please contact the tool author/support team for troubleshooting assistance.

Screenshot 2024-02-01 at 3 21 27 PM

I already have a .azureml/config.json with the below content.

{
  "subscription_id": "xxxxx",
  "resource_group": "rg-test",
  "workspace_name": "demo123"
}

I tried giving like below

    mlindex_content: azureml://subscriptions/xxx/resourcegroups/rg-test/providers/Microsoft.MachineLearningServices/demo123/jaymachinelearning/data/lime-yuca-33pkdyvm05/versions/1
    query_type: Hybrid
    top_k: 3
    queries: ${embed_the_question.output}

Still the same error from UI but when I tried the debug

024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Node modify_query_with_history completes.
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Executing node embed_the_question. node run id: 8861a96f-9c35-463d-934e-c89a1eeae5b0_embed_the_question_0
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Node embed_the_question completes.
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Executing node search_question_from_indexed_docs. node run id: 8861a96f-9c35-463d-934e-c89a1eeae5b0_search_question_from_indexed_docs_0
2024-02-01 09:55:43 +0000   51435 execution          ERROR    Node search_question_from_indexed_docs in line 0 failed. Exception: Execution failure in 'search_question_from_indexed_docs': (AttributeError) 'str' object has no attribute 'get'.
Traceback (most recent call last):
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow/_core/flow_execution_context.py", line 194, in _invoke_tool_with_timer
    return f(**kwargs)
           ^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow/_core/tracer.py", line 220, in wrapped
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow_vectordb/tool/common_index_lookup.py", line 59, in search
    index = MLIndex(mlindex_config=mlindex_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/azureml/rag/mlindex.py", line 111, in __init__
    self.index_config = mlindex_config.get("index", {})
                        ^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

msha1026 commented 4 months ago

@jayendranarumugam Our flow.dag.yaml looks different from yours. This is how ours looks:

mlindex_content: >
  embeddings:
    api_base: https://<azure-openai-name>.api.cognitive.microsoft.com/
    api_type: azure
    api_version: 2023-07-01-preview
    batch_size: '1'
    connection:
      id: /subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.MachineLearningServices/workspaces/<azureml-wksp-name>/connections/<azure_openai_connection_name>
    connection_type: workspace_connection
    deployment: text-embedding-ada-002
    dimension: 1536
    kind: open_ai
    model: text-embedding-ada-002
    schema_version: '2'
  index:
    api_version: 2023-07-01-preview
    connection:
      id: /subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.MachineLearningServices/workspaces/<azureml-wksp-name>/connections/<azure-search-connection-name>
    connection_type: workspace_connection
    endpoint: https://<azure-ai-search-name>.search.windows.net
    engine: azure-sdk
    field_mapping:
      content: <name_of_field_in_search_for_document_contents>
      embedding: <name_of_embedding_field_in_search>
      metadata: <name_of_field_in_search_for_document_id>
    index: <index_name>
    kind: acs
    semantic_configuration_name: null
queries: ${modify_query_with_history.output} #This is the unembedded query. This node should handle embedding the query for you
query_type: Hybrid (vector + keyword)
top_k: 3

pgr-lopes commented 1 month ago

Hi @msha1026 , When I tried the new preview Index I get the error like

promptflow._utils.tool_utils.DynamicListError: Unable to display list of items due to 'Error when calling function promptflow_vectordb.tool.common_index_lookup_utils.list_available_query_types: tool_ui_callback.<locals>.wrapped() missing 3 required positional arguments: 'subscription_id', 'resource_group_name', and 'workspace_name''. Please contact the tool author/support team for troubleshooting assistance.

Screenshot 2024-02-01 at 3 21 27 PM

I already have a .azureml/config.json with the below content.

{
  "subscription_id": "xxxxx",
  "resource_group": "rg-test",
  "workspace_name": "demo123"
}

I tried giving like below

    mlindex_content: azureml://subscriptions/xxx/resourcegroups/rg-test/providers/Microsoft.MachineLearningServices/demo123/jaymachinelearning/data/lime-yuca-33pkdyvm05/versions/1
    query_type: Hybrid
    top_k: 3
    queries: ${embed_the_question.output}

Still the same error from UI but when I tried the debug

024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Node modify_query_with_history completes.
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Executing node embed_the_question. node run id: 8861a96f-9c35-463d-934e-c89a1eeae5b0_embed_the_question_0
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Node embed_the_question completes.
2024-02-01 09:55:43 +0000   51435 execution.flow     INFO     Executing node search_question_from_indexed_docs. node run id: 8861a96f-9c35-463d-934e-c89a1eeae5b0_search_question_from_indexed_docs_0
2024-02-01 09:55:43 +0000   51435 execution          ERROR    Node search_question_from_indexed_docs in line 0 failed. Exception: Execution failure in 'search_question_from_indexed_docs': (AttributeError) 'str' object has no attribute 'get'.
Traceback (most recent call last):
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow/_core/flow_execution_context.py", line 194, in _invoke_tool_with_timer
    return f(**kwargs)
           ^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow/_core/tracer.py", line 220, in wrapped
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/promptflow_vectordb/tool/common_index_lookup.py", line 59, in search
    index = MLIndex(mlindex_config=mlindex_config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/demo/.venv/lib/python3.11/site-packages/azureml/rag/mlindex.py", line 111, in __init__
    self.index_config = mlindex_config.get("index", {})
                        ^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

Did you ever managed to get this working? I am getting the exact same error when using an existing Prompt Flow in my vscode environment. My YAML definition looks exactly like the one posted as the response, I'm not sure where else the problem might be.

Adarsh-Ramanathan commented 1 month ago

@pgr-lopes, you'll need to configure a default subscription/resourcegroup/workspace in your shell before attempting to configure index lookup:

az login
az account set --subscription <subscription_id>
az configure --defaults group=<resource_group_name> workspace=<workspace_name>

@DaweiCai FYI - we should be fetching these values from the config.json, instead of expecting users to configure the same information in two different ways.

@pgr-lopes, this issue is closed; can you please open a new one for us to track?

pgr-lopes commented 1 month ago

@Adarsh-Ramanathan thank you, that makes sense, that's what was confusing me since I was already specifying those variables in the config.json file, and OpenAI connections were working just fine.

I'll try that out using the command line shell and I'll open a new issue for tracking, thanks!