run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.44k stars 5k forks source link

[Bug]: Query Engine seems to be using wrong syntax. #10324

Closed GauravatGrowhut closed 3 months ago

GauravatGrowhut commented 7 months ago

Bug Description

Dependencies:

llama-index: 0.9.39
jsonpath-ng: 1.6.1

Code:

from llama_index.indices.service_context import ServiceContext
from llama_index.indices.struct_store import JSONQueryEngine
from llama_index.llms import OpenAI

with open("test.json") as course_info:
  courses = json.load(course_info)

with open("course_schema.json") as schema:
  course_schema = json.load(schema)

llm = OpenAI(model="gpt-3.5-turbo-1106")
service_context = ServiceContext.from_defaults(llm=llm)

nl_query_engine = JSONQueryEngine(
    json_value=courses,
    json_schema=course_schema,
    service_context=service_context,
)

nl_response = nl_query_engine.query(query)

Version

0.9.39

Steps to Reproduce

Run Above Code With Following JSON:

test.json

{
  "courses": [
    {
      "courseName": "Data Science",
      "courseDescription": "An advanced course covering machine learning, statistics, and data visualization.",
      "coursePrerequisites": ["Bachelor's in Computer Science", "Statistics knowledge"],
      "courseTags": ["Data Science", "Machine Learning", "Statistics"],
      "fees": "$20,000",
      "location": "University of XYZ, New York"
    },
    {
      "courseName": "Cybersecurity",
      "courseDescription": "Focus on advanced techniques in securing digital platforms against cyber threats.",
      "coursePrerequisites": ["Bachelor's in Computer Science", "Knowledge of Networking"],
      "courseTags": ["Cybersecurity", "Network Security", "Digital Platforms"],
      "fees": "$18,000",
      "location": "ABC University, California"
    },
    {
      "courseName": "Artificial Intelligence",
      "courseDescription": "A comprehensive course on AI, machine learning, and deep learning.",
      "coursePrerequisites": ["Bachelor's in Computer Science", "Knowledge of Algorithms"],
      "courseTags": ["Artificial Intelligence", "Machine Learning", "Deep Learning"],
      "fees": "$25,000",
      "location": "PQR University, Massachusetts"
    }
  ]
}

schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "courses": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "courseName": {
            "type": "string"
          },
          "courseDescription": {
            "type": "string"
          },
          "coursePrerequisites": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "courseTags": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "fees": {
            "type": "string"
          },
          "location": {
            "type": "string"
          }
        },
        "required": ["courseName", "courseDescription", "coursePrerequisites", "courseTags", "fees", "location"]
      }
    }
  },
  "required": ["courses"]
}

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
JsonPathParserError                       Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/llama_index/indices/struct_store/json_query.py](https://localhost:8080/#) in default_output_processor(llm_output, json_value)
     55         try:
---> 56             datum: List[DatumInContext] = parse(expression).find(json_value)
     57             if datum:

10 frames
[/usr/local/lib/python3.10/dist-packages/jsonpath_ng/ext/parser.py](https://localhost:8080/#) in parse(path, debug)
    171 def parse(path, debug=False):
--> 172     return ExtentedJsonPathParser(debug=debug).parse(path)

[/usr/local/lib/python3.10/dist-packages/jsonpath_ng/parser.py](https://localhost:8080/#) in parse(self, string, lexer)
     43         lexer = lexer or self.lexer_class()
---> 44         return self.parse_token_stream(lexer.tokenize(string))
     45 

[/usr/local/lib/python3.10/dist-packages/jsonpath_ng/parser.py](https://localhost:8080/#) in parse_token_stream(self, token_iterator, start_symbol)
     66 
---> 67         return new_parser.parse(lexer = IteratorToTokenStream(token_iterator))
     68 

[/usr/local/lib/python3.10/dist-packages/ply/yacc.py](https://localhost:8080/#) in parse(self, input, lexer, debug, tracking, tokenfunc)
    332         else:
--> 333             return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
    334 

[/usr/local/lib/python3.10/dist-packages/ply/yacc.py](https://localhost:8080/#) in parseopt_notrack(self, input, lexer, debug, tracking, tokenfunc)
   1200                         self.state = state
-> 1201                         tok = call_errorfunc(self.errorfunc, errtoken, self)
   1202                         if self.errorok:

[/usr/local/lib/python3.10/dist-packages/ply/yacc.py](https://localhost:8080/#) in call_errorfunc(errorfunc, token, parser)
    191     _restart = parser.restart
--> 192     r = errorfunc(token)
    193     try:

[/usr/local/lib/python3.10/dist-packages/jsonpath_ng/parser.py](https://localhost:8080/#) in p_error(self, t)
     80     def p_error(self, t):
---> 81         raise JsonPathParserError('Parse error at %s:%s near token %s (%s)'
     82                                   % (t.lineno, t.col, t.value, t.type))

JsonPathParserError: Parse error at 1:33 near token ( (()

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
[<ipython-input-19-f66194ef1045>](https://localhost:8080/#) in <cell line: 1>()
----> 1 nl_response = nl_query_engine.query(
      2     query,
      3 )
      4 raw_response = raw_query_engine.query(
      5     query,

[/usr/local/lib/python3.10/dist-packages/llama_index/core/base_query_engine.py](https://localhost:8080/#) in query(self, str_or_query_bundle)
     38             if isinstance(str_or_query_bundle, str):
     39                 str_or_query_bundle = QueryBundle(str_or_query_bundle)
---> 40             return self._query(str_or_query_bundle)
     41 
     42     async def aquery(self, str_or_query_bundle: QueryType) -> RESPONSE_TYPE:

[/usr/local/lib/python3.10/dist-packages/llama_index/indices/struct_store/json_query.py](https://localhost:8080/#) in _query(self, query_bundle)
    148             )
    149 
--> 150         json_path_output = self._output_processor(
    151             json_path_response_str,
    152             self._json_value,

[/usr/local/lib/python3.10/dist-packages/llama_index/indices/struct_store/json_query.py](https://localhost:8080/#) in default_output_processor(llm_output, json_value)
     61                 results[key] = datum[0].value
     62         except Exception as exc:
---> 63             raise ValueError(f"Invalid JSON Path: {expression}") from exc
     64 
     65     return results

ValueError: Invalid JSON Path: $.courses[?(@.courseTags.includes('Computer Science') || @.coursePrerequisites.includes('Bachelor of Technology'))]
dosubot[bot] commented 7 months ago

🤖

Hello @GauravatGrowhut! 👋

I'm Dosu, a friendly bot designed to help you solve bugs, answer questions, and guide you towards becoming a proficient contributor to our projects. I'm here to assist you while we wait for a human maintainer to join us. If there's anything you need, don't hesitate to ask. Let's make your coding journey smoother together! 😄

The error you're encountering is due to the use of the includes method in your JSONPath query. The includes method is not a valid JSONPath operator. JSONPath supports a limited set of operators, which are primarily comparison operators like ==, !=, <, >, etc.

The query that's causing the issue seems to be $.courses[?(@.courseTags.includes('Computer Science') || @.coursePrerequisites.includes('Bachelor of Technology'))]. This query is trying to use the includes method to check if a certain string is included in the courseTags or coursePrerequisites arrays.

To fix this issue, you can use the in operator instead of includes. Here's how you can modify your query:

$.courses[?(@.courseTags in ['Computer Science'] || @.coursePrerequisites in ['Bachelor of Technology'])]

This query will return all courses where 'Computer Science' is in courseTags or 'Bachelor of Technology' is in coursePrerequisites.

Please replace the includes method with the in operator in your query and try again. If you continue to face issues, please provide the exact query you're using so I can assist you better.

For more information on JSONPath syntax and operators, you can refer to the JSONPath GitHub page.

Let me know if this helps! If you have any other questions or run into any other issues, feel free to ask.

-Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nerdai commented 7 months ago

Thanks @GauravatGrowhut for filing this bug. Can confirm that I'm able to reproduce the bug based on your provided code. Someone will look into this shortly!

nerdai commented 7 months ago

@GauravatGrowhut can you actually provide the specific query that wasn't working for you?

The query I tested that doesn't work involved filtering on an array type. The JSONPath expression that gave it problems was:

$.courses[?(@.courseTags.indexOf('Machine Learning') != -1)].courseName
GauravatGrowhut commented 7 months ago

Hi @nerdai thank you for reply.

instructions = "Provide a list of five courses and all their details along with relevancy score. \
Filter the courses based on the student's interests. Then sort by students fees, and finally their qualifications. \
Rank these courses by their relevance, assigning each a score between 0.0 and 1.0. \
Present this information in a clear and understandable format."

with open("student.json") as student_info:
  student = json.load(student_info)

query = f"```\nStudent_Info:```{student}```\nInstruction: {instructions}"

student.json

{
  "UserID": "abcd-1234",
  "Demographic": {
    "Name": "John Doe",
    "Age": 22,
    "Address": {
      "city": "Nagpur",
      "state": "Maharashtra",
      "country": "India"
    },
    "UID": "IN-XYZ123"
  },
  "Qualifications": {
    "Work Experience": [
      {
        "Company": "XYZ",
        "Role": "Software Developer",
        "Duration": "1 year",
        "Description": "Worked on developing and maintaining web applications."
      }
    ],
    "Academics": [
      {
        "degree": "10th Standard",
        "board": "CBSE",
        "percentage": 85,
        "year_of_passing": 2017
      },
      {
        "degree": "12th Standard",
        "board": "CBSE",
        "percentage": 90,
        "year_of_passing": 2019
      },
      {
        "degree": "Bachelor of Technology",
        "major": "Computer Science",
        "university": "XYZ University",
        "GPA": 8.5,
        "year_of_passing": 2023
      }
    ],
    "Certifications": [{"Name": "AWS Cybersecurity Expert"}],
    "Interests":["Cybersecurity"],
    "Max_Fees": "$20,000",
    "Research Papers": [
      {
        "Title": "Machine Learning in Healthcare",
        "Published In": "XYZ Journal",
        "Year": 2023
      }
    ]
  }
}
GauravatGrowhut commented 7 months ago

It is also important to mention the code given in docs works for me. Even after I replace the example dictionaries and queries with mine.

https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/query_engine/json_query_engine.ipynb

nerdai commented 7 months ago

It is also important to mention the code given in docs works for me. Even after I replace the example dictionaries and queries with mine.

https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/query_engine/json_query_engine.ipynb

Oh you mean that your query and your data work when replacing example jsons/queries with your own? You just experience a bug when you don't run in the notebook?

GauravatGrowhut commented 7 months ago

Partially true. Example notebook works with my data and a much smaller query (Provide a List of Relevant Data Science Courses). The same doesn't work on my other notebook with complex query.

ArvinDevel commented 6 months ago

the failure reason I think is that the query statement generated by the LLM is not guaranteed to be correct, we can't totally expect the query give the expected result.

bmaciag commented 6 months ago

Hi all, I have the same issue but this time with the official docs from here: https://docs.llamaindex.ai/en/latest/examples/query_engine/json_query_engine.html

When I run it in colab, in the end I get the error: alueError: Invalid JSON Path: JSONPath: $.comments[?(@.username=='jerry')].content