noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Halucinations? how to set null #35

Closed codenamics closed 10 months ago

codenamics commented 10 months ago

Hello,

Packsge works pretty good but cna you tell me how to enforce null on None example.

`class AnswerFormat(BaseModel): age: str height: str temperature: int systolic_blood_pressure: int diastolic_blood_pressure: int

respiratory: int

pulse: str
is_preg: bool
heart_rate: int`

text = """ 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. """

{ "age": "23", "height": "165", "temperature": 977, "systolic_blood_pressure": 120, "diastolic_blood_pressure": 80, "pulse": "80", "is_preg": true, "heart_rate": 80 }

as we can see blood pressure is not present in prompt but LLM fill it i think by default value

I tried Union[float, None] = None but not respected

any idea?

noamgat commented 10 months ago

I have also observed this behavior. This is (unfortunately) not the fault of lm-format-enforcer, but of the LLMs themselves. Most of the time, probably due to their training data, they think that a json field is required even if its not in the required array.

If you look at the unit tests, you will see that the schema parser allows not including optional fields. So its the LLM's choice to include them, not the format enforcer's requirement.

I reckon that the only true way to solve this, is to fine-tune the LLM with cases where the field is optional, the data is not present and it will eventually "learn" that optional fields are not required.

If you think this is a bug, try to create a unit test with a schema and a response that should be allowed (due to optional fields), but is not.

codenamics commented 10 months ago

ok,

also funnny because when i make some prompt engenering: "Please exctract form {text} following fileds: sex, temperature etc. Output in JSON format it works as expected". Even without enforcer. How lang chain ppydantic json extraction works. I know they are calling functions.

codenamics commented 10 months ago

I will close it but please take a look at it as i think i most cases null or None should be done becasue halucinatons are bad:) maybe apply pydantic after output?

noamgat commented 10 months ago

If pydantic verification fails, it is a bug. Please post exact schema and output string.

This is an online schema checker: https://www.jsonschemavalidator.net/ Anything that LMFE generates should pass it. If not, it's a bug.

On Wed, Dec 6, 2023, 20:13 Damian @.***> wrote:

I will close it but please take a look at it as i think i most cases null or None should be done becasue halucinatons are bad:) maybe apply pydantic after output?

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/35#issuecomment-1843422782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2EWHAMUY5C2RRD3JQDYICYT3AVCNFSM6AAAAABAIPAQJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGQZDENZYGI . You are receiving this because you commented.Message ID: @.***>

ivsanro1 commented 5 months ago

I reckon that the only true way to solve this, is to fine-tune the LLM with cases where the field is optional, the data is not present and it will eventually "learn" that optional fields are not required.

@noamgat the problem of this is that if you ask the LLM to extract A, B, C, it will tend to output the whole json structure, i.e. generating the keys, and then after the key, it has to decide whether it has a value or null.

In other words, forcing the LLM to "lookahead" and not put the key directly if the value cannot be found, which is the workaround using optional fields, is a harder task for them in general, and they're more prone to hallucinations.

For this, it would still be great if lm-format-enforcer allows null values even if fields are optional, or nullable fields with union types, e.g.:

{
  "year": {
    "type": ["number", "null"]
  }
}
noamgat commented 5 months ago

You can achieve this by typing a pydantic field as Optional[ActualType] instead of ActualType

On Sat, Apr 20, 2024 at 9:53 PM Iván Sánchez @.***> wrote:

I reckon that the only true way to solve this, is to fine-tune the LLM with cases where the field is optional, the data is not present and it will eventually "learn" that optional fields are not required.

@noamgat https://github.com/noamgat the problem of this is that if you ask the LLM to extract A, B, C, it will tend to output the whole json structure, i.e. generating the keys, and then after the key, it has to decide whether it has a value or null.

In other words, forcing the LLM to "lookahead" and not put the key directly if the value cannot be found, which is the workaround using optional fields, is a harder task for them in general.

For this, it would still be great if lm-format-enforcer allows null values even if fields are optional, or nullable fields with union types, e.g.:

{ "year": { "type": ["number", "null"] } }

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/35#issuecomment-2067755300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2F7V5TSHSLO7A3EUQTY6K2SHAVCNFSM6AAAAABAIPAQJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXG42TKMZQGA . You are receiving this because you were mentioned.Message ID: @.***>