lennartpollvogt / ollama-instructor

Python library for the instruction and reliable validation of structured outputs (JSON) of Large Language Models (LLMs) with Ollama and Pydantic. -> Deterministic work with LLMs.
MIT License
65 stars 3 forks source link

ValidationError when the input seems fine #6

Closed Tamaya31 closed 4 months ago

Tamaya31 commented 4 months ago

I am encountering a ValidationError that seems to be coming from the fact that at times the output also contains the parent class name, for example if this was my model:

class Answer(BaseModel): a: str b: str

I'd get this output: 'Answer': {'a': 'A sentence', 'b': "Another sentence"}}

Tamaya31 commented 4 months ago

It looks like I fixed it by changing pydantic_model.model_json_schema() to pydantic_model.model_json_schema().get('properties', {}) if you think that is a good fix

lennartpollvogt commented 4 months ago

Hi @Tamaya31

thank you for the issue and the PR. Unfortunately, I am on vacation the next two weeks without my laptop 🙈 I really would like to look into it and make some tests before merging the PR. Would be good, if this improves the default_system_prompt😊. But I want to test it with different LLMs. I hope it is okay, when this has to wait until I am back from vacation.

Maybe you could post the outputs of pydantic_model.model_json_schema() and pydantic_model.model_json_schema().get('properties', {}) for your example?

And the next question: what LLM/s are you using? In most cases having your own system prompt for your specific use case and LLM, is giving you better results in the responses (my personal experience).

Tamaya31 commented 4 months ago

Hello @lennartpollvogt

You are welcome. No worries for the delay, for now I use my fork with the fix, that works good.

To answer your questions: This is the model I am using that illustrate the issue:

class Answer(BaseModel): observation: str recommendation: str

With your original code, it returns this format: 'Answer': {'a': 'A sentence', 'b': "Another sentence"}} This then cannot pass the validation as the parent node is not returned from the LLM, and anyways I don't want it to return it.

With the fix I propose, it returns this: 'a': 'A sentence', 'b': "Another sentence"

I used it against mistral:7b, but I can quickly test it against llama3 or any other model.

Let me know if that helps, I am happy to test for you while you are away if needed.

lennartpollvogt commented 4 months ago

Hi, sorry for the late response and thank you very much for the clarification. I think I will test it when I am back, since there is one thing I would like to know: what happens to the docstring of the pydantic model? I think this will not be covered. But when your approach works better for a greater number of LLMs, then it is worth it. In this case I will have to cover the docstring differently. 😊

lennartpollvogt commented 3 months ago

Hi @Tamaya31 ,

I finally had time to investigate your proposal. Next to my guess of the missing "description" I found another issue which would lead to inaccurate responses. When only getting the "properties" out of the JSON schema the nested models would miss the references ('$defs').

Here is an example:

from pydantic import BaseModel, ConfigDict
from enum import Enum
from typing import List
import rich

class Gender(Enum):
    '''
    Gender of the person.
    '''
    MALE = 'male'
    FEMALE = 'female'

class Person(BaseModel):
    '''
    This is a person.
    '''
    name: str
    age: int
    gender: Gender
    friends: List[str] = []

    model_config = ConfigDict(
        extra='forbid'
    )

rich.print(Person.model_json_schema())
print('###################################')
rich.print(Person.model_json_schema().get('properties'))

Output looks like this:

{
    '$defs': {
        'Gender': {
            'description': 'Gender of the person.',
            'enum': ['male', 'female'],
            'title': 'Gender',
            'type': 'string'
        }
    },
    'additionalProperties': False,
    'description': 'This is a person.',
    'properties': {
        'name': {'title': 'Name', 'type': 'string'},
        'age': {'title': 'Age', 'type': 'integer'},
        'gender': {'$ref': '#/$defs/Gender'},
        'friends': {'default': [], 'items': {'type': 'string'}, 'title': 'Friends', 'type': 'array'}
    },
    'required': ['name', 'age', 'gender'],
    'title': 'Person',
    'type': 'object'
}
###################################
{
    'name': {'title': 'Name', 'type': 'string'},
    'age': {'title': 'Age', 'type': 'integer'},
    'gender': {'$ref': '#/$defs/Gender'},
    'friends': {'default': [], 'items': {'type': 'string'}, 'title': 'Friends', 'type': 'array'}
}

This leads me to stick to the current default system prompt.

But, next to your solution of using a fork, you could simply come with your own system prompt in the "messages" list. I mention this in the docs (see here).

I hope this helps