avoid recency bias in prompt construction

AndreasKarasenko commented 3 months ago

Context According to this paper ChatGPT (and likely other LLMs) suffer from a recency bias. Whatever class comes last has a higher propability of being selected. Issue Currently scikit-llm constructs prompts based on the order of the training data. Since we are recommended to restrict the training data I would usually do something like this:

df = df.groupby(label_col).apply(lambda x: x.sample(n_samples))
df = df.reset_index(drop=True)

Which returns a sorted dataframe by label_col. Even if sort=False is passed to groupby the instances are still clustered by label.

Question/Solution Should a method be implemented that randomizes the order of samples in the prompt / training data, or should users take care of that themselves? The most straightforward way would be to simply add this to sampling:

df = df.sample(frac=1)

Which leaves it up to chance to balance it reasonably.

OKUA1 commented 3 months ago

Hi @AndreasKarasenko,

Yes, the order of the samples introduces some bias. For the regular FewShot this can be easily solved by permuting the training data. It is not that straight-forward in the DynamicFewShot and would require some refactoring.

On the other hand, I am not sure whether it poses such a big problem. The study you provided is from 2021 and hence relatively outdated.

Also, from my personal observations, sometimes even in the ZeroShot setting, the order of the candidate labels is relevant. Therefore, the bias would probably always introduce some bias which can hardly be completely avoided.

AndreasKarasenko commented 3 months ago

A forward search yields this paper from 2024 which supports your last point and also points to this paper from 2021/2022. You're probably right, that accounting for all biases might be out of scope. Maybe a best practices section would be appropriate then?

OKUA1 commented 3 months ago

Yes, I agree that it is a good idea to at least mention it somewhere and in the future think about refactoring the code a bit to minimize this bias.

I will keep the issue open for now.

iryna-kondr / scikit-llm

avoid recency bias in prompt construction #104