Open okhat opened 8 months ago
It would be great to have the same kind of LM abstraction for RMs.
I would create an RM class, like the existent LM class, that all the different third-party retriever models inherit from, instead of inheriting from the Retrieve
module. This would allow to create different advanced retrieve techniques as modules, that would inherit from the Retrieve
module which would use any RM
transparently (which already does but it is confusing because the RM is another Retrieve
module).
Something like ChainOfThought
inheriting from Predict
which uses an LM
underneath.
Yes, RMs too but RMs can honestly just be function calls so it's easier for people to deal with it now.
Actually CoT and Predict should be dspy.Modules. CoT shouldn't inherit from Predict, that's a bad old decision that we'll change.
Sounds perfect! I was wondering if we can shift Example, Prediction and Completions classes to Pydantic.
Tensors are the only dtype in pytorch and so is the case for Examples an dtype but internally all that the other two do could be wrapped in a class method.
This would be a wayy big of a migration and possibly not even backwards compatible. So might wanna think on this.
I disagree with that @krypticmouse . Predictions are already Examples anyway.
I'm for using as much Pydantic as we can here
Indeed, Prediction are basically Example they do have the from_completion method that Examples don't. That doesn't make much difference yes, but yea I thought this could've become a class method of a Pydantic model.
Not a major issue tbh though, just a thought :)
Mostly just for better organization and readability.
Reliably keeping within token limits when compiling and running programs, without having to set the module config for a specific LM, is big for me to be able to deploy this to production. IMO, ideally the config could stay pretty much the same if you move from a 32k context to a 8k context. You'd just recompile and it'd automatically use less or shorter demos and whatever else it needed to
My initial thoughts are that this have two main elements:
The distinction between the 2 elements is because it's not just for compiling that it'd be useful. When we create a module or program it'd be good to be able to estimate tokens so you can do things like limit the amount of context given by retrieval
I want to echo your point @okhat about the instructions/prompts in modules. I think right now they are a little spread out in code as strings in various places that is sometimes appended together. If that could be elevated in terms of abstractions and/or made clearer, it might even make it easier to analyse a module and potentially perform some interesting transformation on itself later down time line. I don't quite think we need to go as far as the prompting first abstractions that langchain offers but prompting is not something we can completely divorce this from, but doing so in a more organised fashion that allows for future analysis could be useful?
Integrating 4 (optimizers) in the thinking early on might be necessary, since they are what put the biggest strain on the API. We need to think about what features they require to be available, such as
Predict
classes etc. that satisfy those.Assertions is another example of an "involved" feature, that needs a lot of support, but hopefully not a lot of special casing. Right now there's the whole new_signature
keyword argument that gets used sometimes, and seems to be introduced specifically for Retry to use.
Hey team, some notes so far:
BayesianSignatureOptimizer
.I like the idea of extending RMs to be more than function calls, but I do think that interfacing, for example, the Weaviate python client with the module’s forward
pass probably will work fine for awhile.
Keeping within token limits sounds amazing. The LMs have an internal max_tokens
state that you could probably just multiply by the upper bound of number of calls in your module’s forward passes. Compiling is another story I don’t know enough about DSPy yet to comment on.
still have a couple more responses to read, will update with mentions.
I'll try to kick off the backend refactor Saturday, if not, @isaacbmiller is down to have the first part ready by Tuesday–Wednesday of next week
Recently, Ollama released an OpenAI-compatible API. Other companies like Mistral AI also offer APIs that follow OpenAI specifications. Additionally, there are projects like LiteLLM that provide API-compatibility layers (e.g., using proxies).
So I think that LM abstraction could potentially just be a single thin wrapper around the OpenAI API specification. Is this a viable option?
@S1M0N38 you would still need the thin wrapper though to pass in optional arguments with kwargs
.
@S1M0N38 @CShorten just came across LiteLLM today — and it seems like a home run for inference (not fine-tuning). Am I missing anything?
@S1M0N38 you would still need the thin wrapper though to pass in optional arguments with
kwargs
.
@CShorten what are optional kwargs
that differ from provider to provider that are need by DSPy? (e.g. I think temperature is one of those needed to control the amount of prompt "exploration", isn't it?). Here for example are the input params that LiteLLM support for different providers.
... just came across LiteLLM today — and it seems like a home run for inference (not fine-tuning). Am I missing anything?
@CyrusOfEden I believe you're correct, but upon examining the code in dsp/modules/[gpt3|cohere|ollama].py, it appears that the only requests being made are HTTP requests to the inference endpoint, namely /completion, chat/completion, /api/generate, api/chat, etc. These are all inference requests for text. Could you elaborate on the fine-tuning you mentioned?
I'm not entirely familiar with the inner workings and requirements of this framework, so everything I've mentioned may not be feasible. Therefore, please take my statements with a grain of salt. In my opinion, for a project like this, it's best to focus on core concepts rather than implementing numerous features. The idea is to defer those to other libraries or leave them to the user to implement, guided by high-quality documentation.
@S1M0N38 I think the way multiple generations are sampled -- for example Cohere has num_generations
but the google.generativeai API has no such option. Probably little nuances like this, but the chart you shared is great, I trust your judgment on this.
@S1M0N38 yup, LiteLLM would be good for inference — and currently LMs don't have a .finetune method but we want that later.
I'm new to this library, I'd love to see more support for production deployment, my prioritized wish list would be:
@bencrouse those are definitely on the roadmap — I think the focus right now is reliability / API / typed outputs / everything Omar originally mentioned and then afterwards we want to do some thinking through what it means to productionize this (async/streaming/deployment/etc.)
+1 For supporting all OpenAI compatible local LLM servers for inferencing and not just Ollama. I think this will increase adoption because a lot of "application developers" of LMs who are not ML experts use tools like LM Studio, GPT4All, etc.
Hi I'm the litellm maintainer - what's missing in litellm to start using it in this repo ? @CyrusOfEden @S1M0N38
Happy to help with any issues / feature requests - even minor ones
Hi I'm the litellm maintainer - what's missing in litellm to start using it in this repo ? @CyrusOfEden @S1M0N38
Happy to help with an issues / feature requests - even minor ones
Tbh I think you're good for now, great to have your support 🔥
@ishaan-jaff how does tool use work with models that don't necessarily support it?
Would be really cool if I could use LiteLLM for tool use for whatever model -- is there a table for tool use support somewhere?
Separately, is LiteLLM able to integrate something like Outlines to support tool use for models that don't natively support it?
Interesting question, I suspect digging into how dspy.ReAct
implements the dspy.Retrieve
tool could be a good start to understanding how to interface all tools @CyrusOfEden.
Maybe this is the argument for why these tools should be integrated deeper into DSPy than externally used as calls to the forward pass (or there is some kind of API contract with the ReAct for passing in arbitrary functions as well).
Yeah I’m in favor of minimal abstractions around tools. I think a tool being just a function whose docstring and arguments and name can be used by ReAct can achieve this
The coupling between DSPy and "tools", in the ReAct sense, should be as light as possible in my opinion. There could be some annotations that could be made that makes optimisation easier, however; for instance, to enable backtracking with it.
More generally, I feel like some of the coupling with other technologies like RMs may be a little too strong at the moment from a software engineering perspective, which I understand the reason behind as it unlocks a lot of cool things like retrieveEnsemble, but it does feel a little specific at the moment.
My vague intuition is that it would be great if DSPy could have a more generic boundaries and well defined boundaries when it comes to the backend components(hopefully backed by classes and types and less so magic strings/attributes). This comment might betrays more of a personal disperference against "duck" typing, but regardless perhaps a more defined internal operating mechanics/schema could make future development of passes and features a lot less burdensome.
Yeah I’m in favor of minimal abstractions around tools. I think a tool being just a function whose docstring and arguments and name can be used by ReAct can achieve this
LiteLLM on it — really liking what I'm finding in this repo — they have a util for converting from a Python function to a JSON schema tool def [0]
@ishaan-jaff how does tool use work with models that don't necessarily support it?
We switch on JSON mode if the provider supports it and we add thefunction
/ tools
to the prompt and then get a response. You would have to explicitly enable this litellm.add_function_to_prompt = True
Does this answer your question ?
@CShorten, @okhat Currently, in WeaviateRM, the Weaviate client is still from Weaviate v3. Are there any plans in the current roadmap to update it to Weaviate Python Client (v4)? If not, should we add it?
Hey @ovshake! Yes planning on updating that soon, but please feel free if interested would really really appreciate it!
On a side note related to the WeaviateRM, @okhat. I am preparing a quick set of demos on the discussion above about how to use Weaviate as an external tool and how to then interface the functions with docstrings for ReAct
. Want to get this nugget in there for the docs.
Thanks @CShorten , I will definitely take a stab at it and open a PR :)
@CyrusOfEden @okhat I started looking into a liteLLM integration as a starting place for a nicer universal integration. We might be blocked from getting an openAI version running until we upgrade to > OpenAI v1.0.0(#403 is merged). I'm still going to try to see if I can get an Ollama version to work.
Hello, Could we also add the feature that allows loading a compiled/optimized module with its assertions and inferences to the roadmap? If that's already supported, please share a notebook. I can propose a notebook if I get the code snippet for this. Thank you!
Question thats come up in a few different places.
With the LiteLLM integration, would we sunset all provider specific classes (OpenAI, Cohere etc), and direct everyone to use the LiteLLM interface?
@ishaan-jaff can you add me on Discord? cyrusofeden
Wanna chat about this DSPy integration
Hi, Omar asked me to tag you @CyrusOfEden . Here is a notebook where I use LangChain's gpt model and am able to use it with native DSPy predict function: [Google Colab]. Please let me know if you have questions
If the trainset and metric function were tied to the module (via the predict function for each module), the teleprompters could optimize each and every layer (module) in a multi-module app. Off the top, this does not seem like a difficult thing to implement.
I realize assertions and suggestions are a step in this direction, but I don't think they are a replacement for optimizing each layer individually.
Thanks @dharrawal, I wrote some thoughts on Discord:
I might be missing some of the point you're trying to make, but in the general case, it's not possible for people to specify a metric on each module The goal of a good optimizer (like in RL problems also) is to figure out good intermediate supervision There are many ways to achieve that, some simple and others more complex, but the consistent thing is that optimizers will prefer intermediate steps that maximize the eventual metric at the end If you want to optimize each layer separately with a metric, you can just compile each module alone, but that's rarely the needed usecase
^ Big topic I think -- I would love to at least be able to control the max_bootstrapped_examples per module in optimization. For example, some tasks like reranking or summarizing 20 documents require more input tokens for few-shot examples than say query generation or yes/no / low cardinality output routers, like vector search or text-to-sql as a random example.
I think it's also related to the concept of taking a single component out, optimizing it separately on the incoming inputs and outputs and plugging it back into the broader network with the compiled signature. But then I'm not sure how the new round of compilation impacts those examples -- although I suppose they are stored as predictor.demos()
so the compiler probably does have some kind of interface in place.
Worked around this by setting ._compiled = True
for those interested.
sent a discord request @CyrusOfEden
@okhat @CyrusOfEden New to dspy, looking to contribute, but the bounty board is empty and all the discussing seems to be happening here; thought I'd tag you directly. How might I actually contribute here?
@mgbvox agreed with the static typing aspects!
Just merged a massive PR on this from the amazing @thomasahle : #451
Even better would be just using enums and other nested pydantic objects. This works but it should ideally just implicitly call the typed versions when annotations are available, and use function calling to generate structured outputs.
class EmotionType(Enum):
sadness = "sadness"
joy = "joy"
love = "love"
anger = "anger"
fear = "fear"
surprise = "surprise"
class Emotion(dspy.Signature):
sentence = dspy.InputField()
sentiment: EmotionType = dspy.OutputField()
sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"
classify = TypedPredictor(Emotion)
classify(sentence=sentence)
Prediction(
sentiment=<EmotionType.fear: 'fear'>
)
@chris-boson You can already do this now, right? I think Cyrus backend work will allow types to go through function calling, or schema APIs or whichever way the underlying model prefers.
Just my personal opinion. But I would much prefer seeing this module adopt an async-await programming model. All of the LLM calls are I/O bound, so the current thread model doesn't make much sense and is harder to debug. It's also much easer to go async -> sync (using asyncio.run()
) than the other way around. This would also make it much simpler to throttle the number of parallel calls or load balance, since there is no need to share a counter/completion state between threads. Such refactors are often hard to do when a project has matured, so I hope you will consider it.
Async is definitely in popular demand and would be nice to have, but I don’t understand the claim about threading. Threads work great right now for throughput.
DSPy has a small number (maybe 5-6) of extremely powerful concepts that have grown organically over the past year as open source.
Internally, it's time for a major refactor that will simplify things greatly and make sure everything works more smoothly. I have received a lot of interest from the community to contribute to this, so we just need to make sure the goals and milestones are clearly defined.
Potential leaders in this space include @CyrusOfEden @stalkermustang (and from the internal DSPy side possibly @krypticmouse and @arnavsinghvi11 but I haven't checked with them) and there has been a lot of in-depth shots at this from @thomasahle so I'm curious if he's also interested broadly in this.
Starting this issue just to collect the necessary tasks and prioritize them in the right dependency order.
Off the top of my head, I think we have to have:
Cleaner LM abstraction that requires a lot less work to maintain and is clearer about the boundaries. The amazing @CyrusOfEden has already defined this on paper. This will include cleaner "Backend" abstraction, which is a bit of a wrapper around LMs that does template adaptation and other things that Cyrus has in mind.
Cleaner Signature abstraction. I think the push by @thomasahle here is perfectly on the right track. Types, and immutable signatures. We just need to make more decisions about how far down Pydnatic/Types we go, and how far down, say, SGLang we go, or having our own logic, etc. I do like outsourcing parsing logic and type logic but we need to make sure it doesn't break existing features.
Cleaner Modules. This is actually easy but needs to be streamlined. Predict, CoT need to be Modules. And they need to store instructions (not leave it to the signature). They need to handle multiple outputs more consistently. This can be any of us really, esp @arnavsinghvi11, me, @krypticmouse , @thomasahle if any of the folks is interested.
Cleaner Optimizers. Well, we can leave this to me and it's a final easy step once signatures and modules are good.
More guidance on common patterns in the docs. We now finally have docs for all the key components and we have plenty of individual examples. But we do not have enough guidance on the common e2e workflows. This also partly includes clear guidance on what people should do with local LMs: ollama for cpu, tgi or sglang or vllm(?) for gpus? What about quantization etc.
Thoughts? Other things we need?
Dumping list of stuff: