stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
18.21k stars 1.39k forks source link

[WIP] Major refactor roadmap #390

Open okhat opened 8 months ago

okhat commented 8 months ago

DSPy has a small number (maybe 5-6) of extremely powerful concepts that have grown organically over the past year as open source.

Internally, it's time for a major refactor that will simplify things greatly and make sure everything works more smoothly. I have received a lot of interest from the community to contribute to this, so we just need to make sure the goals and milestones are clearly defined.

Potential leaders in this space include @CyrusOfEden @stalkermustang (and from the internal DSPy side possibly @krypticmouse and @arnavsinghvi11 but I haven't checked with them) and there has been a lot of in-depth shots at this from @thomasahle so I'm curious if he's also interested broadly in this.

Starting this issue just to collect the necessary tasks and prioritize them in the right dependency order.

Off the top of my head, I think we have to have:

  1. Cleaner LM abstraction that requires a lot less work to maintain and is clearer about the boundaries. The amazing @CyrusOfEden has already defined this on paper. This will include cleaner "Backend" abstraction, which is a bit of a wrapper around LMs that does template adaptation and other things that Cyrus has in mind.

  2. Cleaner Signature abstraction. I think the push by @thomasahle here is perfectly on the right track. Types, and immutable signatures. We just need to make more decisions about how far down Pydnatic/Types we go, and how far down, say, SGLang we go, or having our own logic, etc. I do like outsourcing parsing logic and type logic but we need to make sure it doesn't break existing features.

  3. Cleaner Modules. This is actually easy but needs to be streamlined. Predict, CoT need to be Modules. And they need to store instructions (not leave it to the signature). They need to handle multiple outputs more consistently. This can be any of us really, esp @arnavsinghvi11, me, @krypticmouse , @thomasahle if any of the folks is interested.

  4. Cleaner Optimizers. Well, we can leave this to me and it's a final easy step once signatures and modules are good.

  5. More guidance on common patterns in the docs. We now finally have docs for all the key components and we have plenty of individual examples. But we do not have enough guidance on the common e2e workflows. This also partly includes clear guidance on what people should do with local LMs: ollama for cpu, tgi or sglang or vllm(?) for gpus? What about quantization etc.

Thoughts? Other things we need?

  1. Production concerns? Streaming. Async.
  2. Docstrings.
  3. Various individual issues.

Dumping list of stuff:

  1. Assertions now has an open issue to collect things needing improvement
  2. Don't fail silently if forgot kwarg passed to Predict or pass incorrect name
neoxelox commented 8 months ago

It would be great to have the same kind of LM abstraction for RMs.

I would create an RM class, like the existent LM class, that all the different third-party retriever models inherit from, instead of inheriting from the Retrieve module. This would allow to create different advanced retrieve techniques as modules, that would inherit from the Retrieve module which would use any RM transparently (which already does but it is confusing because the RM is another Retrieve module).

Something like ChainOfThought inheriting from Predict which uses an LM underneath.

okhat commented 8 months ago

Yes, RMs too but RMs can honestly just be function calls so it's easier for people to deal with it now.

Actually CoT and Predict should be dspy.Modules. CoT shouldn't inherit from Predict, that's a bad old decision that we'll change.

krypticmouse commented 8 months ago

Sounds perfect! I was wondering if we can shift Example, Prediction and Completions classes to Pydantic.

Tensors are the only dtype in pytorch and so is the case for Examples an dtype but internally all that the other two do could be wrapped in a class method.

This would be a wayy big of a migration and possibly not even backwards compatible. So might wanna think on this.

okhat commented 8 months ago

I disagree with that @krypticmouse . Predictions are already Examples anyway.

CyrusOfEden commented 8 months ago

I'm for using as much Pydantic as we can here

krypticmouse commented 8 months ago

Indeed, Prediction are basically Example they do have the from_completion method that Examples don't. That doesn't make much difference yes, but yea I thought this could've become a class method of a Pydantic model.

Not a major issue tbh though, just a thought :)

Mostly just for better organization and readability.

fearnworks commented 8 months ago

https://github.com/stanfordnlp/dspy/issues/392

denver-smartspace commented 8 months ago

Reliably keeping within token limits when compiling and running programs, without having to set the module config for a specific LM, is big for me to be able to deploy this to production. IMO, ideally the config could stay pretty much the same if you move from a 32k context to a 8k context. You'd just recompile and it'd automatically use less or shorter demos and whatever else it needed to

My initial thoughts are that this have two main elements:

  1. Add something like an estimate_tokens method to LM. It'd take the same arguments as an LM call but would just return the tokens that would be used if you actually called it. Same idea as a 'what if' in infrastructure deployments, takes the same parameters but doesn't run anything just tells you what'd it'd to if you actually ran it.
  2. Make use of the new estimate_tokens method when compiling to stay within token limits

The distinction between the 2 elements is because it's not just for compiling that it'd be useful. When we create a module or program it'd be good to be able to estimate tokens so you can do things like limit the amount of context given by retrieval

peteryongzhong commented 8 months ago

I want to echo your point @okhat about the instructions/prompts in modules. I think right now they are a little spread out in code as strings in various places that is sometimes appended together. If that could be elevated in terms of abstractions and/or made clearer, it might even make it easier to analyse a module and potentially perform some interesting transformation on itself later down time line. I don't quite think we need to go as far as the prompting first abstractions that langchain offers but prompting is not something we can completely divorce this from, but doing so in a more organised fashion that allows for future analysis could be useful?

thomasahle commented 8 months ago

Integrating 4 (optimizers) in the thinking early on might be necessary, since they are what put the biggest strain on the API. We need to think about what features they require to be available, such as

Assertions is another example of an "involved" feature, that needs a lot of support, but hopefully not a lot of special casing. Right now there's the whole new_signature keyword argument that gets used sometimes, and seems to be introduced specifically for Retry to use.

CyrusOfEden commented 8 months ago

368 should get merged in (or at least, its tests) before we embark on any major refactor because we ought to have tests to ensure we don't have any unintended regressions

CShorten commented 8 months ago

Hey team, some notes so far:

  1. Backend refactor sounds great!
  2. Indeed, this is an interesting one.
  3. Can’t comment on how this is currently configured.
  4. Awesome, certain the team you’ve put together will come up with something interesting for this! Already super love the BayesianSignatureOptimizer.
  5. Ah fantastic, sorry for the delay here — will touch up on the WeaviateRM.

I like the idea of extending RMs to be more than function calls, but I do think that interfacing, for example, the Weaviate python client with the module’s forward pass probably will work fine for awhile.

Keeping within token limits sounds amazing. The LMs have an internal max_tokens state that you could probably just multiply by the upper bound of number of calls in your module’s forward passes. Compiling is another story I don’t know enough about DSPy yet to comment on.

still have a couple more responses to read, will update with mentions.

CyrusOfEden commented 8 months ago

I'll try to kick off the backend refactor Saturday, if not, @isaacbmiller is down to have the first part ready by Tuesday–Wednesday of next week

S1M0N38 commented 8 months ago

Recently, Ollama released an OpenAI-compatible API. Other companies like Mistral AI also offer APIs that follow OpenAI specifications. Additionally, there are projects like LiteLLM that provide API-compatibility layers (e.g., using proxies).

So I think that LM abstraction could potentially just be a single thin wrapper around the OpenAI API specification. Is this a viable option?

CShorten commented 8 months ago

@S1M0N38 you would still need the thin wrapper though to pass in optional arguments with kwargs.

CyrusOfEden commented 8 months ago

@S1M0N38 @CShorten just came across LiteLLM today — and it seems like a home run for inference (not fine-tuning). Am I missing anything?

S1M0N38 commented 8 months ago

@S1M0N38 you would still need the thin wrapper though to pass in optional arguments with kwargs.

@CShorten what are optional kwargs that differ from provider to provider that are need by DSPy? (e.g. I think temperature is one of those needed to control the amount of prompt "exploration", isn't it?). Here for example are the input params that LiteLLM support for different providers.

S1M0N38 commented 8 months ago

... just came across LiteLLM today — and it seems like a home run for inference (not fine-tuning). Am I missing anything?

@CyrusOfEden I believe you're correct, but upon examining the code in dsp/modules/[gpt3|cohere|ollama].py, it appears that the only requests being made are HTTP requests to the inference endpoint, namely /completion, chat/completion, /api/generate, api/chat, etc. These are all inference requests for text. Could you elaborate on the fine-tuning you mentioned?

I'm not entirely familiar with the inner workings and requirements of this framework, so everything I've mentioned may not be feasible. Therefore, please take my statements with a grain of salt. In my opinion, for a project like this, it's best to focus on core concepts rather than implementing numerous features. The idea is to defer those to other libraries or leave them to the user to implement, guided by high-quality documentation.

CShorten commented 8 months ago

@S1M0N38 I think the way multiple generations are sampled -- for example Cohere has num_generations but the google.generativeai API has no such option. Probably little nuances like this, but the chart you shared is great, I trust your judgment on this.

CyrusOfEden commented 8 months ago

@S1M0N38 yup, LiteLLM would be good for inference — and currently LMs don't have a .finetune method but we want that later.

bencrouse commented 8 months ago

I'm new to this library, I'd love to see more support for production deployment, my prioritized wish list would be:

CyrusOfEden commented 8 months ago

@bencrouse those are definitely on the roadmap — I think the focus right now is reliability / API / typed outputs / everything Omar originally mentioned and then afterwards we want to do some thinking through what it means to productionize this (async/streaming/deployment/etc.)

buzypi commented 8 months ago

+1 For supporting all OpenAI compatible local LLM servers for inferencing and not just Ollama. I think this will increase adoption because a lot of "application developers" of LMs who are not ML experts use tools like LM Studio, GPT4All, etc.

ishaan-jaff commented 8 months ago

Hi I'm the litellm maintainer - what's missing in litellm to start using it in this repo ? @CyrusOfEden @S1M0N38

Happy to help with any issues / feature requests - even minor ones

CyrusOfEden commented 8 months ago

Hi I'm the litellm maintainer - what's missing in litellm to start using it in this repo ? @CyrusOfEden @S1M0N38

Happy to help with an issues / feature requests - even minor ones

Tbh I think you're good for now, great to have your support 🔥

CyrusOfEden commented 8 months ago

@ishaan-jaff how does tool use work with models that don't necessarily support it?

Would be really cool if I could use LiteLLM for tool use for whatever model -- is there a table for tool use support somewhere?

Separately, is LiteLLM able to integrate something like Outlines to support tool use for models that don't natively support it?

CShorten commented 8 months ago

Interesting question, I suspect digging into how dspy.ReAct implements the dspy.Retrieve tool could be a good start to understanding how to interface all tools @CyrusOfEden.

Maybe this is the argument for why these tools should be integrated deeper into DSPy than externally used as calls to the forward pass (or there is some kind of API contract with the ReAct for passing in arbitrary functions as well).

okhat commented 8 months ago

Yeah I’m in favor of minimal abstractions around tools. I think a tool being just a function whose docstring and arguments and name can be used by ReAct can achieve this

peteryongzhong commented 8 months ago

The coupling between DSPy and "tools", in the ReAct sense, should be as light as possible in my opinion. There could be some annotations that could be made that makes optimisation easier, however; for instance, to enable backtracking with it.

More generally, I feel like some of the coupling with other technologies like RMs may be a little too strong at the moment from a software engineering perspective, which I understand the reason behind as it unlocks a lot of cool things like retrieveEnsemble, but it does feel a little specific at the moment.

My vague intuition is that it would be great if DSPy could have a more generic boundaries and well defined boundaries when it comes to the backend components(hopefully backed by classes and types and less so magic strings/attributes). This comment might betrays more of a personal disperference against "duck" typing, but regardless perhaps a more defined internal operating mechanics/schema could make future development of passes and features a lot less burdensome.

CyrusOfEden commented 8 months ago

Yeah I’m in favor of minimal abstractions around tools. I think a tool being just a function whose docstring and arguments and name can be used by ReAct can achieve this

LiteLLM on it — really liking what I'm finding in this repo — they have a util for converting from a Python function to a JSON schema tool def [0]

[0] https://litellm.vercel.app/docs/completion/function_call#litellmfunction_to_dict---convert-functions-to-dictionary-for-openai-function-calling

ishaan-jaff commented 8 months ago

@ishaan-jaff how does tool use work with models that don't necessarily support it?

We switch on JSON mode if the provider supports it and we add thefunction / tools to the prompt and then get a response. You would have to explicitly enable this litellm.add_function_to_prompt = True

Does this answer your question ?

ovshake commented 8 months ago

@CShorten, @okhat Currently, in WeaviateRM, the Weaviate client is still from Weaviate v3. Are there any plans in the current roadmap to update it to Weaviate Python Client (v4)? If not, should we add it?

CShorten commented 8 months ago

Hey @ovshake! Yes planning on updating that soon, but please feel free if interested would really really appreciate it!

On a side note related to the WeaviateRM, @okhat. I am preparing a quick set of demos on the discussion above about how to use Weaviate as an external tool and how to then interface the functions with docstrings for ReAct. Want to get this nugget in there for the docs.

ovshake commented 8 months ago

Thanks @CShorten , I will definitely take a stab at it and open a PR :)

isaacbmiller commented 8 months ago

@CyrusOfEden @okhat I started looking into a liteLLM integration as a starting place for a nicer universal integration. We might be blocked from getting an openAI version running until we upgrade to > OpenAI v1.0.0(#403 is merged). I'm still going to try to see if I can get an Ollama version to work.

younes-io commented 8 months ago

Hello, Could we also add the feature that allows loading a compiled/optimized module with its assertions and inferences to the roadmap? If that's already supported, please share a notebook. I can propose a notebook if I get the code snippet for this. Thank you!

KCaverly commented 8 months ago

Question thats come up in a few different places.

With the LiteLLM integration, would we sunset all provider specific classes (OpenAI, Cohere etc), and direct everyone to use the LiteLLM interface?

CyrusOfEden commented 8 months ago

@ishaan-jaff can you add me on Discord? cyrusofeden

Wanna chat about this DSPy integration

collinjung commented 8 months ago

Hi, Omar asked me to tag you @CyrusOfEden . Here is a notebook where I use LangChain's gpt model and am able to use it with native DSPy predict function: [Google Colab]. Please let me know if you have questions

dharrawal commented 8 months ago

If the trainset and metric function were tied to the module (via the predict function for each module), the teleprompters could optimize each and every layer (module) in a multi-module app. Off the top, this does not seem like a difficult thing to implement.

I realize assertions and suggestions are a step in this direction, but I don't think they are a replacement for optimizing each layer individually.

okhat commented 8 months ago

Thanks @dharrawal, I wrote some thoughts on Discord:

I might be missing some of the point you're trying to make, but in the general case, it's not possible for people to specify a metric on each module The goal of a good optimizer (like in RL problems also) is to figure out good intermediate supervision There are many ways to achieve that, some simple and others more complex, but the consistent thing is that optimizers will prefer intermediate steps that maximize the eventual metric at the end If you want to optimize each layer separately with a metric, you can just compile each module alone, but that's rarely the needed usecase

CShorten commented 8 months ago

^ Big topic I think -- I would love to at least be able to control the max_bootstrapped_examples per module in optimization. For example, some tasks like reranking or summarizing 20 documents require more input tokens for few-shot examples than say query generation or yes/no / low cardinality output routers, like vector search or text-to-sql as a random example.

I think it's also related to the concept of taking a single component out, optimizing it separately on the incoming inputs and outputs and plugging it back into the broader network with the compiled signature. But then I'm not sure how the new round of compilation impacts those examples -- although I suppose they are stored as predictor.demos() so the compiler probably does have some kind of interface in place.

Worked around this by setting ._compiled = True for those interested.

ishaan-jaff commented 8 months ago

sent a discord request @CyrusOfEden

mgbvox commented 8 months ago

@okhat @CyrusOfEden New to dspy, looking to contribute, but the bounty board is empty and all the discussing seems to be happening here; thought I'd tag you directly. How might I actually contribute here?

peteryongzhong commented 8 months ago

@mgbvox agreed with the static typing aspects!

okhat commented 8 months ago

Just merged a massive PR on this from the amazing @thomasahle : #451

chris-boson commented 8 months ago

Even better would be just using enums and other nested pydantic objects. This works but it should ideally just implicitly call the typed versions when annotations are available, and use function calling to generate structured outputs.

class EmotionType(Enum):
    sadness = "sadness"
    joy = "joy"
    love = "love"
    anger = "anger"
    fear = "fear"
    surprise = "surprise"

class Emotion(dspy.Signature):
    sentence = dspy.InputField()
    sentiment: EmotionType = dspy.OutputField()

sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"

classify = TypedPredictor(Emotion)
classify(sentence=sentence)
Prediction(
    sentiment=<EmotionType.fear: 'fear'>
)
thomasahle commented 8 months ago

@chris-boson You can already do this now, right? I think Cyrus backend work will allow types to go through function calling, or schema APIs or whichever way the underlying model prefers.

AndreasMadsen commented 7 months ago

Just my personal opinion. But I would much prefer seeing this module adopt an async-await programming model. All of the LLM calls are I/O bound, so the current thread model doesn't make much sense and is harder to debug. It's also much easer to go async -> sync (using asyncio.run()) than the other way around. This would also make it much simpler to throttle the number of parallel calls or load balance, since there is no need to share a counter/completion state between threads. Such refactors are often hard to do when a project has matured, so I hope you will consider it.

okhat commented 7 months ago

Async is definitely in popular demand and would be nice to have, but I don’t understand the claim about threading. Threads work great right now for throughput.