mangiucugna / json_repair

A python module to repair invalid JSON, commonly used to parse the output of LLMs
https://pypi.org/project/json-repair/
MIT License
826 stars 48 forks source link

[Bug]: Fails to accurately capture value with missing opening quote if a comma comes before closing quote #53

Closed ahhardin closed 4 months ago

ahhardin commented 4 months ago

Version of the library

0.23.1

Describe the bug

When parsing broken json that looks like this:

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Syncing with Company",
    "Description": The conversation focused on how this company's release management system integrates with ours, providing a streamlined workflow for documentation approval, unlike Jim.",
    "What the Prospect said": "John was interested in understanding how the release flow works and how it can be used to approve documentation and drawings directly in the product.",
    "Seller Response": "Gene explained that the configuration allows the release flow to start from the other product and push information to ours, enabling a wider team to approve documentation without needing direct access to our product.",
    "Quote": "Okay. So configuration is done right now."
  },
  {
    "Snippet Summary Id": 2,
    "Overview": "Assigning Part Numbers",
    "Description": "The discussion covered the capability of this product to assign part numbers to CAD data, a feature that might differentiate Our product from theirs.",
    "What the Prospect said": "Eve was looking at the part table and seemed curious about how part numbers could be assigned and mapped to categories in our product.",
    "Seller Response": "Gene demonstrated how part numbers could be assigned to CAD data through our product and mapped to various categories, showcasing the product's flexibility.",
    "Quote": "One of the options is that you can ask the product to assign per numbers to your CAD data."
  }
]

The missing quote after "Description": is repaired but instead of closing the quote at the existing closing quote, the package inserts a new quote at the first comma it finds, resulting in this:

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Syncing with Company",
    "Description": "The conversation focused on how this company's release management system integrates with ours",
    "Jim.": "What the Prospect said\": \"John was interested in understanding how the release flow works and how it can be used to approve documentation and drawings directly in the product.",
    "Seller Response": "Gene explained that the configuration allows the release flow to start from the other product and push information to ours, enabling a wider team to approve documentation without needing direct access to our product.",
    "Quote": "Okay. So configuration is done right now."
  },
  {
    "Snippet Summary Id": 2,
    "Overview": "Assigning Part Numbers",
    "Description": "The discussion covered the capability of this product to assign part numbers to CAD data, a feature that might differentiate Our product from theirs.",
    "What the Prospect said": "Eve was looking at the part table and seemed curious about how part numbers could be assigned and mapped to categories in our product.",
    "Seller Response": "Gene demonstrated how part numbers could be assigned to CAD data through our product and mapped to various categories, showcasing the product's flexibility.",
    "Quote": "One of the options is that you can ask the product to assign per numbers to your CAD data."
  }
]

How to reproduce

string = """
[
  {
    "Snippet Summary Id": 1,
    "Overview": "Syncing with Company",
    "Description": The conversation focused on how this company's release management system integrates with ours, providing a streamlined workflow for documentation approval, unlike Jim.",
    "What the Prospect said": "John was interested in understanding how the release flow works and how it can be used to approve documentation and drawings directly in the product.",
    "Seller Response": "Gene explained that the configuration allows the release flow to start from the other product and push information to ours, enabling a wider team to approve documentation without needing direct access to our product.",
    "Quote": "Okay. So configuration is done right now."
  },
  {
    "Snippet Summary Id": 2,
    "Overview": "Assigning Part Numbers",
    "Description": "The discussion covered the capability of this product to assign part numbers to CAD data, a feature that might differentiate Our product from theirs.",
    "What the Prospect said": "Eve was looking at the part table and seemed curious about how part numbers could be assigned and mapped to categories in our product.",
    "Seller Response": "Gene demonstrated how part numbers could be assigned to CAD data through our product and mapped to various categories, showcasing the product's flexibility.",
    "Quote": "One of the options is that you can ask the product to assign per numbers to your CAD data."
  }
]
"""
repair_json(string, return_objects=True)

Expected behavior

I'd expect this:

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Syncing with Company",
    "Description": "The conversation focused on how this company's release management system integrates with ours, providing a streamlined workflow for documentation approval, unlike Jim.",
    "What the Prospect said": "John was interested in understanding how the release flow works and how it can be used to approve documentation and drawings directly in the product.",
    "Seller Response": "Gene explained that the configuration allows the release flow to start from the other product and push information to ours, enabling a wider team to approve documentation without needing direct access to our product.",
    "Quote": "Okay. So configuration is done right now."
  },
  {
    "Snippet Summary Id": 2,
    "Overview": "Assigning Part Numbers",
    "Description": "The discussion covered the capability of this product to assign part numbers to CAD data, a feature that might differentiate Our product from theirs.",
    "What the Prospect said": "Eve was looking at the part table and seemed curious about how part numbers could be assigned and mapped to categories in our product.",
    "Seller Response": "Gene demonstrated how part numbers could be assigned to CAD data through our product and mapped to various categories, showcasing the product's flexibility.",
    "Quote": "One of the options is that you can ask the product to assign per numbers to your CAD data."
  }
]

Overall this is an awesome tool!! It's handled everything else I've thrown at it perfectly.

mangiucugna commented 4 months ago

Thanks for reporting the issue, give a spin to 0.24.0 and let me know if it works

ahhardin commented 4 months ago

Oh wow! First of all you are an absolute legend for this! It works perfectly for that case. Now that this issue is fixed I found one more 😅 , I can put it here but lmk if you want me to open a new issue and I can also do that (I do not want to be greedy I am already ecstatic that you fixed the first issue):

I also get errors when there's no opening OR closing quote but there are commas in the value (but not between the k-v pairs, sadly):

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Transition from company 1 to company 2",
    "Description": The conversation touches on a customer who moved from company 1 to company 2 and encountered some challenges with the employment of record services. This highlights a particular area of scrutiny from the prospect's perspective.
    "What the Prospect said": Marie mentions a customer who transitioned from company 1 to company 2. This customer praised company 2 but raised an issue concerning the employment of record services, specifically related to contractual obligations and compensation adjustments.
    "Seller Response": Alex acknowledges the comment and prepares to proceed with the meeting by introducing more participants into the room.
    "Quote": "They came over from company 1... she dropped a flag on me."
  }
]

parses into:

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Transition from company 1 to company 2",
    "Description": "The conversation touches on a customer who moved from company 1 to company 2 and encountered some challenges with the employment of record services. This highlights a particular area of scrutiny from the prospect's perspective.\n    \"What the Prospect said",
    "Seller Response": "Alex acknowledges the comment and prepares to proceed with the meeting by introducing more participants into the room.\n    \"Quote\": \"They came over from company 1... she dropped a flag on me."
  }
]

but I want:

[
  {
    "Snippet Summary Id": 1,
    "Overview": "Transition from company 1 to company 2",
    "Description": "The conversation touches on a customer who moved from company 1 to company 2 and encountered some challenges with the employment of record services. This highlights a particular area of scrutiny from the prospect's perspective.",
    "What the Prospect said": "Marie mentions a customer who transitioned from company 1 to company 2. This customer praised company 2 but raised an issue concerning the employment of record services, specifically related to contractual obligations and compensation adjustments.",
    "Seller Response": "Alex acknowledges the comment and prepares to proceed with the meeting by introducing more participants into the room.",
    "Quote": "They came over from company 1... she dropped a flag on me."
  }
]

There are also no commas here between the k-v pairs so this whole thing is kind of a mess.

Thanks again for the 0.24.0 fix, you're incredible!!

mangiucugna commented 4 months ago

No worries about opening a new issue, I have released 0.25.0 that should fix this case more generally

ahhardin commented 4 months ago

Works perfectly 🏆 thank you so much!!