Structured Output Parser Always Gives Error

vikasr111 commented 10 months ago

System Info

langchain==0.0.330 openai==0.28.1 python==3.9.17

Who can help?

@hwchase17 @agola11

Information

[X] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[X] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

I have written a simple structured output parser. I am using to extract useful data from a document text. Here's my code:

import os
import logging
from dotenv import load_dotenv
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

load_dotenv()

response_schemas = [
    ResponseSchema(
        name="document_type", description="Type of document, typically found on top"
    ),
    ResponseSchema(name="shipper", description="Shipper name found in the data"),
    ResponseSchema(name="consignee", description="Consignee name found in the data"),
    ResponseSchema(name="point_of_origin", description="Point of origin in the data"),
    ResponseSchema(
        name="customer_order_number", description="Customer order number in the data"
    ),
    ResponseSchema(
        name="order_number", description="Order number mentioned in the data"
    ),
    ResponseSchema(
        name="bill_of_lading",
        description="Bill of lading number(B/L number) found in the data",
    ),
    ResponseSchema(
        name="carrier_name", description="Carrier name mentioned in the data"
    ),
    ResponseSchema(
        name="required_ship_date", description="Required ship date in date format"
    ),
    ResponseSchema(
        name="shipped_date", description="Shipped date, typically separated by /"
    ),
    ResponseSchema(
        name="transportation_mode", description="Transportation mode such as truck etc."
    ),
    ResponseSchema(
        name="vehicle_number", description="Vehicle number found in the data"
    ),
    ResponseSchema(name="routing_info", description="Routing info found in the data"),
    ResponseSchema(
        name="invoice_to_buyer", description="Invoice to buyer data found in the data"
    ),
    ResponseSchema(
        name="consignee_number", description="Consignee number mentioned in the data"
    ),
    ResponseSchema(
        name="net_weight",
        description="Net weight found in the data, typically found on the second page. It's a number succeeded by weight symbol such as kg/lb/1b/15/16 and ends with NT (Net weight).",
    ),
    ResponseSchema(name="ticket_number", description="Ticket number found in the data"),
    ResponseSchema(name="outbound_date", description="Outbound date found in the data"),
]

system_prompt = """
Following is the data extracted from a document through OCR wrapped inside <ocr_data> delimeter. It may be unstructured and unorganized, and you'll help me extract key information from this data. The data can be nuanced, and field and it's respective values may be at different positions. The presented data can be of multiple pages, separated by (------). Analyze the OCR data below and give me the value of given fields. If you can't find the values in the OCR data, simply return 'N/A'.
"""

class BolAgent:
    def __init__(self):
        self.openai_api_key = os.getenv("OPENAI_API_KEY")
        self.llm = OpenAI(
            openai_api_key=self.openai_api_key, temperature=0.1, max_tokens=1000
        )
        self.chat_model = ChatOpenAI(
            model="gpt-3.5-turbo-16k",
            openai_api_key=self.openai_api_key,
            temperature=0,
        )
        self.response_schemas = response_schemas
        self.system_prompt = system_prompt

    def extract_paramerts(
        self,
        ocr_data,
    ):
        output_parser = StructuredOutputParser.from_response_schemas(
            self.response_schemas
        )
        input_data = f"<ocr_data>/n{ocr_data}/n</ocr_data>"

        format_instructions = output_parser.get_format_instructions()

        prompt = ChatPromptTemplate(
            messages=[
                HumanMessagePromptTemplate.from_template(
                    "{system_prompt}\n\n{format_instructions}\n\n{input_data}"
                )
            ],
            input_variables=["system_prompt", "input_data"],
            partial_variables={"format_instructions": format_instructions},
        )

        llm_input = prompt.format_prompt(
            system_prompt=system_prompt, input_data=input_data
        )
        logging.info(f"LLM Input: {llm_input}")

        output = self.chat_model(llm_input.to_messages())
        logging.info(f"LLM Output: {output}")
        result = output_parser.parse(output.content)
        return result

When I use this code on any input data, the output parser gives error most of the time. Here's a sample input data:

\n------------------\nSTRAIGHT BILL OF LADING - SHORT FORM\nTEST\nCHEMTRADE\nFICHE D\'EXPEDITION - FORMULE REGULIERE\nS4D\nSHIPPER/EXPEDITEUR\nChemtrade West Limited Partnership\nTIME IN/ARRIVEE\nGROSS/BRUT\nCONSIGNEE/DESTINATAIRE\nSASK POWER\n TARE\nSHIP TO/EXPEDIEZ A\nCORY \nCOGENERATION STATION\nTIME OUT/DEPART\n8 KM W OF SASKATOON HWY 7\nNET\nVANSCOY SOK 1VO SK CA\nPOINT OF ORIGIN/POINT D\'EXPEDITION\nCUSTOMER ORDER NO./N DE COMMANDE DU CLIENT\nORDER NO./N DE COMM.\n3/L NO./NDE CONN.\nCHEMTRADE (SASKATOON)\nS\n1856\n80001877\n CARRIER NAME/NOM DU TRANSPORTEUR\nREQUIRED SHIP DATE/DATE EXP.DEM.\nDATE SHIPPED/EXPEDIE LE\nCARON TRANSPORT LTD\nNov 06,2023\nTRANSPORTATION MODE/MODE DE TRANSPORT\nVEHICLE T/C NO. - MARQUE DU WAGON\nTruck\n UNIVAR CANADA LTD.\n ROUTING/ITINERAIRE\nCONSIGNEE#/CONSIGNATAIRE\nPAGE\n600929\n1 of\n3\nNO.AND DESCRIPTION OF PACKS\nD.G.\nDESCRIPTION OF ARTICLES AND SPECIAL MARKS\nNET WEIGHT KG\nNBRE ET DESCRIPTION DE COLIS\nDESCRIPTION DES ARTICLES ET INDICATIONS SPECIALS\nPOIDS NET\n1 TT\nX\nUN1830, SULFURIC ACID, 8, PG II\n21.000 Tonne\nSULFURIC ACID 93%\nER GUIDE #137\n4 PLACARDS REQUIRED; CLASS 8, CORROSIVE\nSTCC 4930040\nSulfuric Acid 93%\nCOA W/ SHIPMENT\nDELIVERY HOURS: 8:OOAM-1: OOPM MON-THURS\nATTENDANCE DURING OFFLOAD REQUIRED\nSAFETY GOGGLES, FACE SHIELD, GLOVES, BOOTS, HARD\nHAT, STEEL TOED SHOES, PROTECTIVE SUIT\n3" QUICK CONNECT CAMLOCK; 1 HOSE REQUIRED\nPersonal Protective Equipment: Gloves. Protective clothing. Protective goggles. Face shield.\nnsufficient ventilation: wear respiratory protection.\nERP 2-1564 and Chemtrade Logistics 24-Hour Number >>\n1-866-416-4404\nPIU 2-1564 et Chemtrade Logistics Numero de 24 heures >>\n1-866-416-4404\nConsignor / Expediteur:\nLocation / Endroit:\nCHEMTRADE WEST LIMITED PARTNERSHIP\n11TH STREET WEST\nI hereby declare that the contents of this consignment are fully and accurately described above by the proper shipping\nSASKATOON SK CA\nare in all respects in proper condition for transport according to the Transportation of Dangerous Goods Regulations.\nS7K 4C8\nPer/Par:Michael Rumble, EHS Director, Risk Management\nIF CHARGES ARE TO BE PREPAID, WRITE OR STAMP\nJe declare que le contenu de ce chargement est decrit ci-dessus de faconcomplete et exacte par Iappellation reglementaire\nINDIQUER ICI SI L\'ENVOI SE FAIT EN "PORT-PAYE"\negards bien conditionne pouretre transporte conformement au Reglement sur le transport des marchandises dangereuses.\nPrepaid\nFORWARD INVOICE FOR PREPAID FREIGHT\nChemtrade West Limited Partnership\nQUOTING OUR B/L NO.TO:\n155 Gordon\nBaker Rd #300\nWeight Agreement\nFAIRE SUIVRE FACTURE POUR EXPEDITION PORT\nToronto,\nOnt.\nM2H 3N5\nPAYE EN REFERANT A NOTRE NUMERO DE CONN.A:\nSHIPPER\nChemtrade West Limited\nAGENT\nCONSIGNEE.\nEXPEDITEUR\nPartnership\nDESTINATAIRE\nPER\nPERMANENT POST OFFICE ADDRESS OF SHIPPER\nPER\nPER\nPAR\n(ADRESSE POSTALE PERMANENTE DE L\'EXPEDITEUR)\nTHESE PRODUCTS ARE SOLD AND SHIPPED IN ACCORDANCE WITH\nTHE TERMS OF SALES ON THE REVERSE SIDE OF THIS,DOCUMENT.\nResponsible Care\nCES PRODUITS SONT VENDUS ET EXPEDIES CONFORMEMENTAUX\nCONDITIONS DE VENTE APPARAISSANT AU VERSO DE LA PRESENTE\nOur commitment to sustainability.\nS4D PRASRNov 06,2023 1618

Upon further debugging I found that for some reason the output has two triple-backticks at the end and because of this the Structured Output Parser ends up giving the error. Here the output for better clarity (Notice the end of output):

content='```json\n{\n\t"document_type": "STRAIGHT BILL OF LADING - SHORT FORM",\n\t"shipper": "Chemtrade West Limited Partnership",\n\t"consignee": "SASK POWER",\n\t"point_of_origin": "VANSCOY SOK 1VO SK CA",\n\t"customer_order_number": "80001877",\n\t"order_number": "1856",\n\t"bill_of_lading": "600929",\n\t"carrier_name": "CARON TRANSPORT LTD",\n\t"required_ship_date": "Nov 06,2023",\n\t"shipped_date": "Nov 06,2023",\n\t"transportation_mode": "Truck",\n\t"vehicle_number": "T/C NO.",\n\t"routing_info": "UNIVAR CANADA LTD.",\n\t"invoice_to_buyer": "Chemtrade West Limited Partnership",\n\t"consignee_number": "600929",\n\t"net_weight": "21.000 Tonne",\n\t"ticket_number": "N/A",\n\t"outbound_date": "N/A"\n}\n```\n```'

I have started to notice this error at high-frequency after OpenAI dev day. Any idea what I might be doing wrong?

Expected behavior

The output should only have one triple-backticks at the end and the output parser should parse the output properly.

dosubot[bot] commented 10 months ago

🤖

Hello @vikasr111! I'm here to help you with your coding issues. As a bot, I'll do my best to assist you while we wait for a human maintainer to join us. I'm currently reviewing your issue and will be back with a full answer shortly.

dosubot[bot] commented 10 months ago

🤖

Thank you for providing detailed information about the issue you're facing. It seems like the problem is related to the output format of the language model, which is causing the StructuredOutputParser to fail.

The StructuredOutputParser expects the output to be a Markdown code snippet formatted in a specific schema. This schema is defined by the response_schemas attribute of the StructuredOutputParser class, which is a list of ResponseSchema objects. Each ResponseSchema object has a name, description, and type. The output should include the leading and trailing "json" and "", and the content should be a JSON object where each key-value pair corresponds to a ResponseSchema object.

From the output you provided, it seems like there are two sets of triple backticks at the end of the output, which is causing the parser to fail. This is likely due to the language model generating an extra set of backticks.

There are a few similar issues that have been solved in the LangChain repository:

These issues suggest that the problem might be related to the non-greedy matching of the regular expression in the file libs/langchain/langchain/output_parsers/json.py. The proposed solution is to change the regular expression to use greedy matching by replacing .*? with .*. Additionally, it's suggested to add a test case in test_json.py unit test to cover the scenario of an answer containing a code block.

Another potential solution is to modify the parse method to check if the triple backticks are present in the text before attempting to split it. If they are not present, it directly proceeds to the AgentFinish block.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

vikasr111 commented 10 months ago

@dosu-bot Thanks for the analysis but that doesn't solve my problem. Since I am using standard output parser from langchain, I am not sure how can I address this issue.

dosubot[bot] commented 10 months ago

🤖

Based on the context provided, it seems like the issue you're facing is related to the output parser in the LangChain framework not being able to handle the case where the output ends with two triple-backticks. This is likely causing the Structured Output Parser to throw an error.

There have been similar issues reported in the LangChain repository, such as Chat agent does not parse properly when model returns a code block, Conversational Chat Agent: json.decoder.JSONDecodeError, and LLM output parsing error for answers including code sections.. These issues suggest that the problem might be due to the way the output parser is handling the model's output.

To address this issue, you could modify the parse method in your StructuredOutputParser class to handle the case where the output ends with two triple-backticks. Here's an example of how you can do this:

def parse(self, text: str) -> Any:
    """Returns the input text with no changes."""
    # Check if the text ends with two triple-backticks
    if text.endswith("```"*2):
        # Remove the two triple-backticks from the end of the text
        text = text[:-6]
    expected_keys = [rs.name for rs in self.response_schemas]
    return parse_and_check_json_markdown(text, expected_keys)

This code checks if the text ends with two triple-backticks. If it does, it removes them from the end of the text before returning it. This should prevent the output parser from failing when it encounters this case.

Please note that this is a simple solution and might not cover all possible edge cases. Depending on the specifics of your application, you might need to implement a more sophisticated solution.

As for the changes in the latest version (v0.0.332) of LangChain that could potentially affect the output parser, I wasn't able to find any specific information in the repository. It's possible that the changes were made elsewhere or I might have missed them. If you have more information about the changes made in this version, it could help in diagnosing the issue.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

vikasr111 commented 10 months ago

This is not the right solution at all. If you notice the output from LLM ends with "\n\n" I'd appreciate if I can get some help from a human. I guess AI is still not there after all :D

dosubot[bot] commented 10 months ago

@baskaryan, I could use your expertise on this one.

dosubot[bot] commented 7 months ago

Hi, @vikasr111,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you raised an issue regarding errors in the structured output parser of the langchain library, particularly due to the presence of two sets of triple-backticks at the end of the output. This issue has been observed more frequently after the OpenAI dev day. Despite proposed solutions not solving the problem, it seems that the issue remains unresolved.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days. Thank you for your understanding and cooperation.

I

langchain-ai / langchain