gkamradt / LLMTest_NeedleInAHaystack

Doing simple retrieval from LLM models at various context lengths to measure accuracy
Other
1.55k stars 159 forks source link

[Feature Proposal] Multi-needle in a haystack #41

Open jsharf opened 7 months ago

jsharf commented 7 months ago

I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:

A implies B B implies C, D D implies E. B is true

Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)

Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.

gkamradt commented 7 months ago

Hey! Awesome post and request

Couple things:

  1. I totally agree that reasoning should be a part of the next set of tests. As an aside, I've been wondering what the "unit test" of reasoning is - what is the minimal amount of reasoning we can start with? It may be the transitive reasoning you're referring to here. I like this because you can easily append additional chains, and even put forks in the logic.
  2. Lance from LangChain added multi-needle recall, but it didn't have reasoning in there.

We are trying to have the repo separate tests from providers from evaluators, other than that. No style guide.

Contributions are very welcome and we'll be quick with feedback

gkamradt commented 7 months ago

More context here too https://twitter.com/GregKamradt/status/1772491996063526971