microsoft / promptflow

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
https://microsoft.github.io/promptflow/
MIT License
9.48k stars 867 forks source link

[BUG] Promptflow ParserError/IndexError #2046

Closed Kevinlee49 closed 6 months ago

Kevinlee49 commented 8 months ago

I'm using hybrid search and semantic search simultaneously and got this error. Do you happen to know what is the solution for this?

2024-02-19_15-39-40

This is the output from semantic search node.

"Execution failure in 'semantic_search': (IndexError) string index out of range"

This is the output from hybrid search node.

"Execution failure in 'hybrid_search': (ParserError) while parsing a block mapping\n in \"\", line 2, column 3:\n api_base: ... \n ^ (line: 2)\nexpected , but found ''\n in \"\", line 9, column 15:\n deployment: text-embedding-ada-002\n ^ (line: 9)"

brynn-code commented 8 months ago

Hi, please provide the tool name if you are using our built-in tool, seems semantic_search and hybrid_search are the names of your flow node. (BTW, the github issue mostly target for the open source promptflow version, if you are using promptflow inside Azure Machine Learning workspace, open OCV (as below) is highly recommended) image

And the built-in tool list you could find at 'More tools': image

Kevinlee49 commented 8 months ago

@brynn-code Hello, thanks for reply. oh yes, I didn't clarify the tool. These two nodes are index lookup tools. And as your recommendation, I left feedback in OCV.

brynn-code commented 8 months ago

@dans-msft and @Adarsh-Ramanathan , could you please help on the index lookup tool issue? Thanks!

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49, can you share a minimal flow yaml that results in a repro so we can look into this?

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan Here it is,

environment:
  python_requirements_txt: requirements.txt
inputs:
  question:
    type: string
    is_chat_input: false
    default: blah blah blah
outputs:
  output:
    type: string
    reference: ${answer_the_question_with_context.output}
    evaluation_only: false
    is_chat_output: true
nodes:
- name: answer_the_question_with_context
  type: llm
  source:
    type: code
    path: answer_the_question_with_context.jinja2
  inputs:
    deployment_name: gpt-35-turbo-16k-v0613
    temperature: 0.7
    top_p: 1
    max_tokens: 1000
    response_format:
      type: text
    presence_penalty: 0
    frequency_penalty: 0
    final_prompt: ${final_prompt.output}
  provider: AzureOpenAI
  connection: test-connection
  api: chat
  module: promptflow.tools.aoai
  aggregation: false
  use_variants: false
- name: input_classify_and_rephrase
  type: prompt
  source:
    type: code
    path: input_classify_and_rephrase.jinja2
  inputs:
    question: ${inputs.question}
  use_variants: false
- name: semantic_search
  type: python
  source:
    type: package
    tool: promptflow_vectordb.tool.common_index_lookup.search
  inputs:
    mlindex_content: >
      embeddings:
        api_base: https://
        api_type: azure
        api_version: 2023-07-01-preview
        batch_size: '16'
        connection:
          id: /subscriptions/
        connection_type: workspace_connection
        deployment: text-embedding-ada-002
        dimension: 1536
        file_format_version: '2'
        kind: open_ai
        model: text-embedding-ada-002
        schema_version: '2'
      index:
        api_version: 2023-07-01-preview
        connection:
          id: /subscriptions/
        connection_type: workspace_connection
        endpoint: https://
        engine: azure-sdk
        field_mapping:
          content: content
          embedding: contentVector
          filename: filepath
          metadata: meta_json_string
          title: title
          url: url
        index: pf-rpindex
        kind: acs
        semantic_configuration_name: azureml-default
    queries: ${python_query_analyze_and_rephrase.output}
    query_type: Semantic
    top_k: 3
  use_variants: false
- name: hybrid_search
  type: python
  source:
    type: package
    tool: promptflow_vectordb.tool.common_index_lookup.search
  inputs:
    mlindex_content: >
      embeddings:
        api_base: https://
        api_type: azure
        api_version: 2023-07-01-preview
        batch_size: '16'
        connection:
          id: /subscriptions/
        connection_type: workspace_connection
        deployment: text-embedding-ada-002
        dimension: 1536
        file_format_version: '2'
        kind: open_ai
        model: text-embedding-ada-002
        schema_version: '2'
      index:
        api_version: 2023-07-01-preview
        connection:
          id: /subscriptions/
        connection_type: workspace_connection
        endpoint: https://
        engine: azure-sdk
        field_mapping:
          content: content
          embedding: contentVector
          filename: filepath
          metadata: meta_json_string
          title: title
          url: url
        index: pf-rpindex
        kind: acs
        semantic_configuration_name: azureml-default
    queries: ${input_classify_and_rephrase.output}
    query_type: Hybrid (vector + keyword)
    top_k: 3
  use_variants: false
- name: generate_context
  type: python
  source:
    type: code
    path: generate_context.py
  inputs:
    hybrid_search_output: ${hybrid_search.output}
    semantic_search_output: ${semantic_search.output}
  use_variants: false
- name: final_prompt
  type: prompt
  source:
    type: code
    path: final_prompt.jinja2
  inputs:
    context: ${generate_context.output}
    question: ${inputs.question}
  use_variants: false
- name: python_query_analyze_and_rephrase
  type: python
  source:
    type: code
    path: python_query_analyze_and_rephrase.py
  inputs:
    question: ${inputs.question}
  use_variants: false
Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan Hello, did you get the solution for this issue?

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49 , not yet. I'll post an update once I've investigated.

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49 , I'm assuming you have actual URLs etc in the various mlindex fields (like api_base and similar), and you've just redacted them before posting on here. I'm having a hard time reproing this issue; can you capture the outputs of the python_query_analyze_and_rephrase and input_classify_and_rephrase steps when executing your flow?

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan Of course, I have my actual urls, etc in my fields. Just the only thing is I realized that only in hybrid search part is problematic. It kept saying that I need vector fields.

outputs of the python_query_analyze_and_rephrase and input_classify_and_rephrase are respectively [topic_keyword] [question], like weather [how's the weather today?] , for input_classify_and_rephrase output is just prompt. Because it is a jinja2 file.

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan or do you think it is because I am using index lookup twice in one step? semantic_search and hybrid_search nodes are only different in queries(one is from python output and the other is from jinja output) and query type which are semantic and hybrid(keyword+vector).

image

Run failed: Execution failure in 'hybrid_search': (HttpResponseError) (InvalidRequestParameter) At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Parameter name: vector.fields Code: InvalidRequestParameter Message: At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Parameter name: vector.fields Exception Details: (InvalidVectorQuery) At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Code: InvalidVectorQuery Message: At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. hybrid_search : Execution failure in 'hybrid_search': (HttpResponseError) (InvalidRequestParameter) At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Parameter name: vector.fields Code: InvalidRequestParameter Message: At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Parameter name: vector.fields Exception Details: (InvalidVectorQuery) At least one vector field needs to be selected explicitly using the 'vector.fields' parameter. Code: InvalidVectorQuery Message: At least one vector field needs to be selected explicitly using the 'vector.fields' parameter.

Adarsh-Ramanathan commented 8 months ago

Can you share a run's example output for input_classify_and_rephrase? I'm not able to repro this issue, our best bet is to eliminate variables and isolate the problem down to the actual lookup nodes, and run it with a configuration that's as close to the one you're running, inputs and all.

Can you also provide info about the runtime version you're using, and if you've installed/updated/overriden any packages? A pip freeze dump would be useful.

To answer your question, no - introducing multiple instances of index lookup is definitely supported - the reason hybrid is failing and semantic is not is that ACS doesn't need a vector input for semantic search, while hybrid does. The issue is that your MLIndex looks like it's configured correctly, so we should have been able to produce a vector to send to ACS - this is what we need to get to the bottom of.

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan On VScode the errors look like this image image image

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan

_Can you share a run's example output for input_classify_andrephrase? ->

assistant: ~~~ system: ~~~~ conversation: ~~~ user: ~~~

Can you also provide info about the runtime version you're using, and if you've installed/updated/overriden any packages? -> how can I show this? .

Adarsh-Ramanathan commented 8 months ago
assistant: ~~~
system: ~~~~
conversation: ~~~
user: ~~~

Is this a string? A list of strings? An object?


Can you also provide info about the runtime version you're using, and if you've installed/updated/overriden any packages? -> how can I show this? You could add a python step with these contents to your flow, and grab it's stdout:

@tool
def my_python_tool(input1: str) -> str:
    from pip._internal.operations import freeze
    pkgs = freeze.freeze()
    for pkg in pkgs: print(pkg)
Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan _

Is this a string? A list of strings? An object?

_ -> it's a string

Can you also provide info about the runtime version you're using, and if you've installed/updated/overriden any packages?

-> runtimeversion.txt

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49 , I'm still unable to reproduce your error.

Here's the minimal flow I built off of your example: image

Flow yaml:

inputs:
  question:
    type: string
    default: blah blah blah
    is_chat_input: false
outputs:
  output:
    type: string
    reference: ${generate_context.output}
    evaluation_only: false
    is_chat_output: true
nodes:
- name: input_classify_and_rephrase
  type: prompt
  source:
    type: code
    path: input_classify_and_rephrase.jinja2
  inputs:
    question: ${inputs.question}
  use_variants: false
- name: hybrid_search
  type: python
  source:
    type: package
    tool: promptflow_vectordb.tool.common_index_lookup.search
  inputs:
    mlindex_content: >
      embeddings:
        api_base: ****
        api_type: azure
        api_version: 2023-07-01-preview
        batch_size: '16'
        connection:
          id: ****
        connection_type: workspace_connection
        deployment: text-embedding-ada-002
        dimension: 1536
        file_format_version: '2'
        kind: open_ai
        model: text-embedding-ada-002
        schema_version: '2'
      index:
        api_version: 2023-07-01-preview
        connection:
          id: ****
        connection_type: workspace_connection
        endpoint: ****
        engine: azure-sdk
        field_mapping:
          content: content
          embedding: contentVector
          filename: filepath
          metadata: meta_json_string
          title: title
          url: url
        index: ****
        kind: acs
        semantic_configuration_name: azureml-default
    queries: ${input_classify_and_rephrase.output}
    query_type: Hybrid (vector + keyword)
    top_k: 3
  use_variants: false
- name: generate_context
  type: python
  source:
    type: code
    path: generate_context.py
  inputs:
    search_result: ${hybrid_search.output}
  use_variants: false
node_variants: {}
environment:
  python_requirements_txt: requirements.txt

requirements.txt:

promptflow_vectordb[azure]

generate_context.py:

from typing import List
from promptflow import tool
import json
from pip._internal.operations import freeze

@tool
def generate_prompt_context(search_result: List[dict]) -> str:
    return json.dumps(list(freeze.freeze()))

input_classify_and_rephrase.jinja2:

system: You are a helpful bot that finds answers to questions.
user: {{ question }}
assistant:

I'm running this with an automatic runtime in westus2. Can you try running this flow and see if your issue still persists?

Your issue in vscode is unrelated, IIRC, you need to configure a number of azure defaults beforehand to get things to play nice: https://microsoft.github.io/promptflow/how-to-guides/develop-a-tool/create-dynamic-list-tool-input.html#faqs

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan I tested it already. It's working when I use hybrid search alone, but when I use 2 index lookup nodes together, it's not working. Can you try to make one more node for semantic and test it again?

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan And it's fun to see that when the error occurred if I just run each node individually, then it's working until the end of flow. But if I just clicked the run button for the whole flow, it said error in index lookup nodes. It worked 2 search nodes together several times (like 4 times in a row) but after 4th, it started to not work again. So, sometimes it works and most of the attempts cause errors.

If you don't mind, can we set up a quick call or meeting? I want to show in person this error.

2024-03-04_17-13-02 2024-03-04_17-14-03


image

Adarsh-Ramanathan commented 8 months ago

Alright, I finally managed to get a repro going! The key was to have more than one lookup node, and to run the flow several times, since it doesn't repro deterministically!

I'll investigate further and post updates as I have them.

If you have additional info you want to share over a call, then sure, feel free to set up some time.

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan That's correct. I see, I just left feedback requests a few weeks ago on azure ml studio promptflow page, but I couldn't get any contact. You're the fastest person who replies my question.

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49, just posting an update. I spent some time investigating yesterday, and I think I understand why this bug occurs; we'll start working on a patch soon. Unfortunately, the real issue is a couple of links down the dependency chain, but I think we can work around it in the tools package.

If you'd like to test a release candidate to help verify (when we have one, that is), please reach out.

Kevinlee49 commented 8 months ago

@Adarsh-Ramanathan Sure, thank you! I will look forward to your message and new update!

Adarsh-Ramanathan commented 8 months ago

@Kevinlee49, I have a candidate runtime image for you to test. I've run it several times through the flow I used to repro, and as far as I can tell, the issue is fixed. Would you be willing to test on your flow to confirm?

You can pull the image from adramapfdev.azurecr.io/promptflow-runtime:20240314.

You'll need to a) pull this image and re-push it to your workspace ACR, b) create a custom environment in your workspace with the image from your workspace ACR and an empty conda file, and c) update your CI runtime, choose the custom environment option, and pick the environment you created in (b).

ChenJieting commented 8 months ago

Issue #2026 pertains to the same question as this issue.

Kevinlee49 commented 8 months ago

@Kevinlee49, I have a candidate runtime image for you to test. I've run it several times through the flow I used to repro, and as far as I can tell, the issue is fixed. Would you be willing to test on your flow to confirm?

You can pull the image from adramapfdev.azurecr.io/promptflow-runtime:20240314.

You'll need to a) pull this image and re-push it to your workspace ACR, b) create a custom environment in your workspace with the image from your workspace ACR and an empty conda file, and c) update your CI runtime, choose the custom environment option, and pick the environment you created in (b).

@Adarsh-Ramanathan thank you, I will try it today! @Adarsh-Ramanathan I was trying to create a new runtime as following your instruction, but I got this error. Do you know how to handle this?

Runtime pf-test2-runtime create failed. FlowRuntime pf-test2-runtime in compute instance mylee-compute2 is not ready: runtime starting timeout. Please try to create a new compute instance to hold runtime

github-actions[bot] commented 7 months ago

Hi, we're sending this friendly reminder because we haven't heard back from you in 30 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 7 days of this comment, the issue will be automatically closed. Thank you!