ju-bezdek / langchain-decorators

syntactic sugar 🍭 for langchain
MIT License
228 stars 11 forks source link

Pydantic parsing error #4

Closed tcaminel-pro closed 1 year ago

tcaminel-pro commented 1 year ago

First, congrat for that library. I like it. However, I've found a very strange behaviour.

When running the code hereafter, the first promp works, but the second one raise an error

  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TestOk
synomym
  field required (type=value_error.missing)

The error seems related to the key name 'synomyms'. Any other name seems OK.

from pydantic import BaseModel, Field
from langchain_decorators import llm_prompt,
)

class TestOk(BaseModel):
    word: str = Field(description="provided word")
    synomym: list[str] = Field(description="synomyms or brand names")

class TestKo(BaseModel):
    word: str = Field(description="provided word")
    synomyms: list[str] = Field(description="synomyms or brand names")

@llm_prompt()
def test_ok(word: str) -> TestOk:
    """
    Please find synonyms for : {word}
    {FORMAT_INSTRUCTIONS}   """
    return  

@llm_prompt()
def test_ko(word: str) -> TestKo:
    """
    Please find synonyms for : {word}
    {FORMAT_INSTRUCTIONS}  """ 
    return  

print("#### FIRST TEST ####")
print(test_ok(word="petrol"))     # OK
print("#### SECOND TEST ####")
print(test_ko(word="petrol"))     # Pydantic parsing error
ju-bezdek commented 1 year ago

Glad you like it...

This is funny indeed...

I tested the Code you provided and it seems that GPT is trying to fix the typo (synomym-> synomym), thus is it won't parse

For me, it failed on the first example already...

My advice would be to enable debug console logging, so that you would see what is going on there... although if you have, you might have missed it... I actually couldn't see it (n and m are so similar) and was baffled a while too, since I saw that the field was generated :)

The easiest way to enable verbose mode is to set up an env: "LANGCHAIN_DECORATORS_VERBOSE": "true", (I can't believe its not documented here)

Alternatively, you can also try PromptWatch which is natively supported for tracing all details.

ju-bezdek commented 1 year ago

Hi, can we close this? Did fixing the typo work for you?

tcaminel-pro commented 1 year ago

Yes, it works! Thanks for the help. I've added "use a spell checker" as a recommandation for prompt engineering for my team...

tcaminel-pro commented 1 year ago

Hi,

Not a bug but a suggestion: Sometimes, it's the LLM that creates typos in JSON keys. I saw that with LLama-2. And sometimes it 'corrects' the key name, as saw before.

To handle these cases, I've hacked 'align_fields_with_model' with a fuzzy match:

from fuzzywuzzy import process

def align_fields_with_model(data: dict, model: Type[BaseModel]) -> dict:
         ....
        elif field_info.field_info.alias.lower() in data:
                value = data[field_info.field_info.alias.lower()]
        else:
            value = correct_typo_in_key(field, data)  # <==== Hack added by TC
        if not data_with_compressed_keys:
           ....

def correct_typo_in_key(field: str, data: dict):
    """Try to correct an incorrect key returned by the LLM by using a fuzzy match with expected schema"""
    spurious_key, score = process.extractOne(field, data.keys())  
    return data[spurious_key] if score >= 80 else None
ju-bezdek commented 1 year ago

Hey, I love that... Feel free to open PR, I'd accept this

tcaminel-pro commented 1 year ago

My code base as quite diverged from yours so opening a PR is not convenient.

BTW, I have found a nice way to find a JSON object anywhere in the LLM answer, using recursive regex. Here is my code:

import regex
def json_finder(text: str) -> str:
    text = text.strip() + "}"  # add an extra } in case it's missing
    pattern = regex.compile(r"\{(?:[^{}]|(?R))*\}")  # recursive regexp
    r = pattern.findall(text)
    if count := len(r) != 1:
        raise OutputParserException(  f"No or multiple JSON found ({count})", llm_output=text  )
    return r[0]
ju-bezdek commented 1 year ago

Why is it better than the native JsonOutputParser?

All you need to do is annotate the output with ->dict return type, it will be automatically resolved

But you can use any outputparser you wish... Either from LangChain, or to build your own (just follow standard LangChain practice)

tcaminel-pro commented 1 year ago

I use that code in the PydanticOutputParser in replacement of that simpler regexp.

           regex_pattern = r"\[.*\]" if self.as_list else r"\{.*\}"
           match = re.search(regex_pattern, text.strip(),re.MULTILINE | re.IGNORECASE | re.DOTALL)
             ...

The ability to find the JSON anywhere in the LLM answer is useful - typically with LLAMA-2 that has the bad habit to "explain" his outcome with bla bla.