Heading:
During benchmarking of various feedback providers, I notice there are models (i.e. a finetuned mixtral-8x7b that tend to give 10.0 instead of 10 in their feedback scoring before normalization. In the current implementation, PATTERN_INTEGER will extract 0 and 10 from 10.0 and eventually pick the lesser value.
Failing example when testing groundedness feedback functions - where the score accompanying COT was interpreted as 0, instead of the expected 10:
0.0,
{'reasons': 'STATEMENT 0:\nCriteria: I am a man and I love fish,\nSupporting Evidence: The source states "All men love fish", and the statement contains "I am a man and I love fish". The source contains the information that a man loves fish, and the statement contains the information that the speaker is a man and loves fish.\nScore: 10.0\n'}
I'm switching to PATTERN_NUMBER to unblock for now.
Other details that are good to know but need not be announced:
This is only a stopgap solution and I might be missing some contexts here as in why PATTERN_INTEGER was used over PATTERN_NUMBER in the previous PR. cc @sfc-gh-pmardziel to add more background if I'm missing sth obvious.
I do believe we should move toward structured and systematic feedback score generation mechanisms with some self-refining prompt iterations (i.e. via DSPy) ASAP for more robust score generation, ideally before integrating w/ the monitoring stack, even at the cost of slightly higher token usage/cost/latency (which can also be alleviated via better prompts and instruction tuning).
Items to add to release announcement:
mixtral-8x7b
that tend to give10.0
instead of10
in their feedback scoring before normalization. In the current implementation,PATTERN_INTEGER
will extract0
and10
from10.0
and eventually pick the lesser value.Failing example when testing groundedness feedback functions - where the score accompanying COT was interpreted as
0
, instead of the expected10
:I'm switching to
PATTERN_NUMBER
to unblock for now.Other details that are good to know but need not be announced:
This is only a stopgap solution and I might be missing some contexts here as in why
PATTERN_INTEGER
was used overPATTERN_NUMBER
in the previous PR. cc @sfc-gh-pmardziel to add more background if I'm missing sth obvious.I do believe we should move toward structured and systematic feedback score generation mechanisms with some self-refining prompt iterations (i.e. via DSPy) ASAP for more robust score generation, ideally before integrating w/ the monitoring stack, even at the cost of slightly higher token usage/cost/latency (which can also be alleviated via better prompts and instruction tuning).