stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
17.52k stars 1.34k forks source link

Add 🤗Hugs Adapter (various useful huggingface adapters for specialized models) #524

Open Josephrp opened 7 months ago

Josephrp commented 7 months ago

Issue

llm based "1-5" prompts suffer from numerous biases like confirmation bias.

Solution

Identify hosted endpoints on huggingface for specialized models that are more appropriate for dspy specific tasks like evaluation, agent functions.

Contribute Here:

useful links :

Tasks

kylebarone commented 7 months ago

@Josephrp Do you have any example papers or literature on the biases with LLM evaluators? Also do you have any good key-words or links to start looking for appropriate evaluator models?

Joshmantova commented 7 months ago

Do we want to use hosted endpoints that already exist within prebuilt spaces or do we want to curate a list of HF models and support custom-built endpoints based on those? I'd recommend supporting custom-built endpoints rather than existing spaces to support enterprise use cases. Additionally, I'd be concerned about the reliability of existing spaces.

I don't see any examples of adaters in dspy/adapter - how do we envision adapters being integrated? Should this not be implemented in dsp/modules?

Here are a list of potential models I've identified that might be helpful for this use case. A great starting point would be to identify models to generate each metric within the RAGAS testing framework. Note that I have not tried any of these models but I think models that solve some of these tasks might be useful for this issue's use case.

I'd love some feedback on these questions / this approach. If it's along the lines of what the team is envisioning, I might have time to take a shot at a PR to implement this.

Josephrp commented 6 months ago

i sure do : here's the tldr https://twitter.com/aparnadhinak/status/1748368364395721128?s=46&t=qGQoqKRt4WFyU9raI-qJVg

On Mon, Mar 4, 2024 at 12:48 AM Kyle Barone @.***> wrote:

@Josephrp https://github.com/Josephrp Do you have any example papers or literature on the biases with LLM evaluators? Also do you have any good key-words or links to start looking for appropriate evaluator models?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/dspy/issues/524#issuecomment-1975402551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEK6QQGBFWZJ6TVX72TEXPTYWOZFHAVCNFSM6AAAAABEC6YWRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGQYDENJVGE . You are receiving this because you were mentioned.Message ID: @.***>

Josephrp commented 6 months ago

your approach is sound and your list is great . i'm perusing how some of the "MLOps" folks are doing it , but basically it would be useful to organize it correctly so it's easy to expand and use :-)

On Sat, Mar 9, 2024 at 6:01 PM Joshmantova @.***> wrote:

Do we want to use hosted endpoints that already exist within prebuilt spaces or do we want to curate a list of HF models and support custom-built endpoints based on those? I'd recommend supporting custom-built endpoints rather than existing spaces to support enterprise use cases. Additionally, I'd be concerned about the reliability of existing spaces.

I don't see any examples of adaters in dspy/adapter - how do we envision adapters being integrated? Should this not be implemented in dsp/modules?

Here are a list of potential models I've identified that might be helpful for this use case. A great starting point would be to identify models to generate each metric within the RAGAS https://docs.ragas.io/en/stable/concepts/metrics/index.html testing framework. Note that I have not tried any of these models but I think models that solve some of these tasks might be useful for this issue's use case.

I'd love some feedback on these questions / this approach. If it's along the lines of what the team is envisioning, I might have time to take a shot at a PR to implement this.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/dspy/issues/524#issuecomment-1986917729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEK6QQDVBFLECA2SJCWSCP3YXM57HAVCNFSM6AAAAABEC6YWRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWHEYTONZSHE . You are receiving this because you were mentioned.Message ID: @.***>

Josephrp commented 6 months ago

i updated the issue description with a link to your list , since it's so great :-) ... now comes the hard part , and making a branch prolly 👍🏻

Josephrp commented 6 months ago

@Joshmantova @kylebarone , if y'all want i made a branch on my account because i couldnt publish branches here actually :-) if y'all want we could surely do something jointly by the end of the month at least a fairly robust start :-)

Branch

https://github.com/Josephrp/dspy/tree/feature/hug-adapter

Joshmantova commented 6 months ago

A couple of notes about implementation:

Open questions:

Once we clear up these open questions, implementation should be easy. Do you guys want to help look into these open questions?

Josephrp commented 6 months ago

hey there , regarding open questions :

  1. sometimes yes, although with hf endpoints , less difficult as it's mainly "prompt formatting"
  2. i really dont think creating anything on a managed service is the way to go. there are thousands of endpoints available on huggingface already
  3. not sure we want to step on that while a separate class might also be easier for development and testing
  4. rate limits are a problem, token limits less so. that's because endpoints "sleep" quite quickly, so there's basically the need to ping, wait, then ping again. token limits i would suggest hard character limits .
  5. well the models identified above are kinda random, having tried a lot of these models the recommended thing is to select 2-5 industry standard models and use those, then expand as required. Lets try identifying these and start there?

just my personal take btw, this isnt coming from DSPy maintainers or even a real coder :-)

Josephrp commented 6 months ago

hey there folks, i just updated the issue + making a focal point for this work as it requires quite a lot of coordination.

I'm actually clucking for these evaluation functions so i'm basically going to start piecemeal with :

Josephrp commented 6 months ago

hey there folks @okhat & @thomasahle :

so here i'll just basically start to do it , but you know before i imagine a way to make these, i wanted to basically ask the repo maintainers if there was a way to organise things with a higher chance of passing the review process.

if you have any ideas , it would be great so i can follow these :-)

Josephrp commented 6 months ago

hey @kylebarone i'm basically a couple of days from getting started here : https://git.tonic-ai.com/contribute/DSPy/dspy , it's actually public normally you can just use git normally, what i hope is that if there's a bunch of us we can make a few of these pretty quickly , various endpoint request clients i mean and eval functions. hope you join us :-) basically i'm using a bunch of tutorials and people's code to make an app, now, i'm making my first "unique functions" in this app which is make synthetic long form in the format of a research article, then use that synthetic data to create another long form which is research proposal . it's pretty cool to me you can do that as a user with one button click :-) that said i'm at the point where i want my favorite evaluations :-) but yeah, always a bit daunting starting with a clean slate, so i'll make sure to circle back and say when and if i've started a branch at least :-) dont be shy to kickstart things if you have bright ideas and code though, i'm a taker !

Josephrp commented 5 months ago

is it a good idea to start doing this and this inside the backend refactor from @CyrusOfEden and @okhat ? i think the best is to simply contribute to there isnt it especially if it's modular enough to be in its own corner ?

CyrusOfEden commented 5 months ago

Hey @Josephrp, it would be! What's here that isn't covered by LiteLLM's coverage of HuggingFace? https://docs.litellm.ai/docs/#basic-usage

Josephrp commented 5 months ago

Hey @Josephrp, it would be! What's here that isn't covered by LiteLLM's coverage of HuggingFace? https://docs.litellm.ai/docs/#basic-usage

great question @CyrusOfEden ! turns out it doesnt do any other than completion llms for text, so it's not exactly useful for classifiers, and models that return numbers , json maps, and so on. using specialized models is simply better for evaluation , so that's why it's important to get the "industry standard" models in there at least. and it will help folks quickly build evaluations.

CyrusOfEden commented 5 months ago

Oh cool! That makes sense. Can you branch off backend-refactor and adopt a similar prepare_request / process_response structure?

Josephrp commented 5 months ago

Oh cool! That makes sense. Can you branch off backend-refactor and adopt a similar prepare_request / process_response structure?

i sure can, or at least i think so , hope that's okay with y'all then, i'll try that , to branch from that brnach and try to get these features in there.