Add 🤗Hugs Adapter (various useful huggingface adapters for specialized models)

Josephrp commented 7 months ago

Issue

llm based "1-5" prompts suffer from numerous biases like confirmation bias.

Solution

Identify hosted endpoints on huggingface for specialized models that are more appropriate for dspy specific tasks like evaluation, agent functions.

Contribute Here:

Contribute Here:https:/git.tonic-ai.com/contribute/DSPy/dspy/

useful links :

DSPy/Hf_client
evaluation
auto-evaluation
full lm interoperability : good for standards used here

Tasks

[ ] identify and select appropriate models on huggingface
[ ] produce the documentation for each function
[ ] add modules to dsp/modules
- [ ] Add hugs call in LM
[ ] add utilities to dsp/utils
[ ] Add hugs.py adapter(s) in dspy/adapter
[ ] add testing notebooks

kylebarone commented 7 months ago

@Josephrp Do you have any example papers or literature on the biases with LLM evaluators? Also do you have any good key-words or links to start looking for appropriate evaluator models?

Joshmantova commented 7 months ago

Do we want to use hosted endpoints that already exist within prebuilt spaces or do we want to curate a list of HF models and support custom-built endpoints based on those? I'd recommend supporting custom-built endpoints rather than existing spaces to support enterprise use cases. Additionally, I'd be concerned about the reliability of existing spaces.

I don't see any examples of adaters in dspy/adapter - how do we envision adapters being integrated? Should this not be implemented in dsp/modules?

Here are a list of potential models I've identified that might be helpful for this use case. A great starting point would be to identify models to generate each metric within the RAGAS testing framework. Note that I have not tried any of these models but I think models that solve some of these tasks might be useful for this issue's use case.

I'd love some feedback on these questions / this approach. If it's along the lines of what the team is envisioning, I might have time to take a shot at a PR to implement this.

Josephrp commented 6 months ago

i sure do : here's the tldr https://twitter.com/aparnadhinak/status/1748368364395721128?s=46&t=qGQoqKRt4WFyU9raI-qJVg

On Mon, Mar 4, 2024 at 12:48 AM Kyle Barone @.***> wrote:

@Josephrp https://github.com/Josephrp Do you have any example papers or literature on the biases with LLM evaluators? Also do you have any good key-words or links to start looking for appropriate evaluator models?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/dspy/issues/524#issuecomment-1975402551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEK6QQGBFWZJ6TVX72TEXPTYWOZFHAVCNFSM6AAAAABEC6YWRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGQYDENJVGE . You are receiving this because you were mentioned.Message ID: @.***>

Josephrp commented 6 months ago

your approach is sound and your list is great . i'm perusing how some of the "MLOps" folks are doing it , but basically it would be useful to organize it correctly so it's easy to expand and use :-)

On Sat, Mar 9, 2024 at 6:01 PM Joshmantova @.***> wrote:

Do we want to use hosted endpoints that already exist within prebuilt spaces or do we want to curate a list of HF models and support custom-built endpoints based on those? I'd recommend supporting custom-built endpoints rather than existing spaces to support enterprise use cases. Additionally, I'd be concerned about the reliability of existing spaces.

I don't see any examples of adaters in dspy/adapter - how do we envision adapters being integrated? Should this not be implemented in dsp/modules?

Here are a list of potential models I've identified that might be helpful for this use case. A great starting point would be to identify models to generate each metric within the RAGAS https://docs.ragas.io/en/stable/concepts/metrics/index.html testing framework. Note that I have not tried any of these models but I think models that solve some of these tasks might be useful for this issue's use case.

Hallucination eval https://huggingface.co/vectara/hallucination_evaluation_model

Relevance eval https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1

Toxicity https://huggingface.co/s-nlp/roberta_toxicity_classifier

Fact / opinion classifier https://huggingface.co/lighteternal/fact-or-opinion-xlmr-el

Bias detection https://huggingface.co/d4data/bias-detection-model

Sentiment https://huggingface.co/nam194/sentiment

NSFW detection https://huggingface.co/qiuhuachuan/NSFW-detector

Programming language classification https://huggingface.co/philomath-1209/programming-language-identification

Faithfulness classifier https://huggingface.co/CogComp/bart-faithful-summary-detector

Grammar / correctness / completeness classifier https://huggingface.co/Ashishkr/query_wellformedness_score

I'd love some feedback on these questions / this approach. If it's along the lines of what the team is envisioning, I might have time to take a shot at a PR to implement this.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/dspy/issues/524#issuecomment-1986917729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEK6QQDVBFLECA2SJCWSCP3YXM57HAVCNFSM6AAAAABEC6YWRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWHEYTONZSHE . You are receiving this because you were mentioned.Message ID: @.***>

Josephrp commented 6 months ago

i updated the issue description with a link to your list , since it's so great :-) ... now comes the hard part , and making a branch prolly 👍🏻

Josephrp commented 6 months ago

@Joshmantova @kylebarone , if y'all want i made a branch on my account because i couldnt publish branches here actually :-) if y'all want we could surely do something jointly by the end of the month at least a fairly robust start :-)

Branch

https://github.com/Josephrp/dspy/tree/feature/hug-adapter

Joshmantova commented 6 months ago

A couple of notes about implementation:

It looks like the current HF client is built for local models - not models from HF endpoints
We should create a provider to make calls to HF endpoints (or just hijack one of the other providers and provide a custom URL)
We should then create an abstract HF adapter that allows users to specify a model name, and anything else that substantially differs between models, and allows users to easily make calls to HF endpoints. This should also follow practices outlined in DSPy such as implementing or piggybacking off of an existing inspect_history method.
Then we can just create adapters for each model that allows users to import that adapter and use it like they would any other model without having to worry about the endpoint calling implementation. This would allow users to see a curated list of models and use them easily.

Open questions:

Do these HF models require us to call them very differently?
Do we want to automatically create the custom endpoint for the user or do we want to assume they have created it in their HF account?
Is the existing HF model too specifically built for local models or can we refactor it to be used in both remote and local use cases?
How do we want to handle rate limits and token limits?
Are these selected models high enough quality for us to firmly recommend to the community? Should we instead create DSPy models using GPT4 that assess each of the selected metrics rather than relying on a HF model for each? That's probably what I'd do first if this was something I wanted to do in my use case.

Once we clear up these open questions, implementation should be easy. Do you guys want to help look into these open questions?

Josephrp commented 6 months ago

hey there , regarding open questions :

sometimes yes, although with hf endpoints , less difficult as it's mainly "prompt formatting"
i really dont think creating anything on a managed service is the way to go. there are thousands of endpoints available on huggingface already
not sure we want to step on that while a separate class might also be easier for development and testing
rate limits are a problem, token limits less so. that's because endpoints "sleep" quite quickly, so there's basically the need to ping, wait, then ping again. token limits i would suggest hard character limits .
well the models identified above are kinda random, having tried a lot of these models the recommended thing is to select 2-5 industry standard models and use those, then expand as required. Lets try identifying these and start there?

just my personal take btw, this isnt coming from DSPy maintainers or even a real coder :-)

Josephrp commented 6 months ago

hey there folks, i just updated the issue + making a focal point for this work as it requires quite a lot of coordination.

I'm actually clucking for these evaluation functions so i'm basically going to start piecemeal with :

adding simple hugs request function(s) including a dummy and some specific calls
revisiting the list above with basically my list of industry standard classifiers and stuff like that.
trying to stay organised and maintainable so the first task is doing the call to huggingface endpoints correctly and in a forward thinking way 👍🏻

Josephrp commented 6 months ago

hey there folks @okhat & @thomasahle :

so here i'll just basically start to do it , but you know before i imagine a way to make these, i wanted to basically ask the repo maintainers if there was a way to organise things with a higher chance of passing the review process.

if you have any ideas , it would be great so i can follow these :-)

Josephrp commented 6 months ago

hey @kylebarone i'm basically a couple of days from getting started here : https://git.tonic-ai.com/contribute/DSPy/dspy , it's actually public normally you can just use git normally, what i hope is that if there's a bunch of us we can make a few of these pretty quickly , various endpoint request clients i mean and eval functions. hope you join us :-) basically i'm using a bunch of tutorials and people's code to make an app, now, i'm making my first "unique functions" in this app which is make synthetic long form in the format of a research article, then use that synthetic data to create another long form which is research proposal . it's pretty cool to me you can do that as a user with one button click :-) that said i'm at the point where i want my favorite evaluations :-) but yeah, always a bit daunting starting with a clean slate, so i'll make sure to circle back and say when and if i've started a branch at least :-) dont be shy to kickstart things if you have bright ideas and code though, i'm a taker !

Josephrp commented 5 months ago

is it a good idea to start doing this and this inside the backend refactor from @CyrusOfEden and @okhat ? i think the best is to simply contribute to there isnt it especially if it's modular enough to be in its own corner ?

CyrusOfEden commented 5 months ago

Hey @Josephrp, it would be! What's here that isn't covered by LiteLLM's coverage of HuggingFace? https://docs.litellm.ai/docs/#basic-usage

Josephrp commented 5 months ago

Hey @Josephrp, it would be! What's here that isn't covered by LiteLLM's coverage of HuggingFace? https://docs.litellm.ai/docs/#basic-usage

great question @CyrusOfEden ! turns out it doesnt do any other than completion llms for text, so it's not exactly useful for classifiers, and models that return numbers , json maps, and so on. using specialized models is simply better for evaluation , so that's why it's important to get the "industry standard" models in there at least. and it will help folks quickly build evaluations.

CyrusOfEden commented 5 months ago

Oh cool! That makes sense. Can you branch off backend-refactor and adopt a similar prepare_request / process_response structure?

Josephrp commented 5 months ago

Oh cool! That makes sense. Can you branch off backend-refactor and adopt a similar prepare_request / process_response structure?

i sure can, or at least i think so , hope that's okay with y'all then, i'll try that , to branch from that brnach and try to get these features in there.

stanfordnlp / dspy