streaming after LiteLLM integration

MohammedAlhajji commented 4 weeks ago

In my current setup, I write everything in DSPy, then I extract the prompt form the dspy module. Then, I use that prompt with litellm to stream the output to the user(if the module is chain of thought i only stream the last output key). I really don't like the hacky-ness of this and would love to have streaming within dspy.

I know streaming support is in the roadmap and we've moved to LiteLLM which makes the interface with the LLM simpler. Is there any current work on streaming support? I am willing to contribute if it's not being worked on yet. It doesn't seem like too many changes but maybe I'm not fully aware of all details to implement it

okhat commented 4 weeks ago

Thanks a lot @MohammedAlhajji ! How are you thinking of streaming? How would you want to stream if you have multiple output fields in your DSPy signature?

MohammedAlhajji commented 3 weeks ago

Sorry for the late response. work has been hectic. My thesis is that streaming requirements comes from a desire to improve perceived latency to the user. Streaming for other use cases seems like it'd yield(pun intended) more headache than benefits. We can create a streaming flag in dspy.LM that takes a boolean or the string "all".

1- streaming=False is the default and what we have now. 2- if streaming = 'True', then we return the object returned by litellm as is. It's the developer's problem to extract what's needed out of it.

Maybe we can create a utility function that allows the user to pick what field gets streamed to the user. This utility function just checks for the field name and streams what comes out of it until it finds another field name and ends the stream.

okhat commented 3 weeks ago

Thanks a lot @MohammedAlhajji ! This is very helpful.

I would prefer that we go beyond just streaming in dspy.LM, since that is easy to accomplish by just calling litellm directly. In other words, dspy.Predict would remain unable to stream if we did this.

Instead, here's a proposal and you let me know if it fulfills your usecases:

What if we support streaming in dspy.Predict but only when there's a single output field, of type str?

For example, you can do this:

Allowed: dspy.Predict('context, question -> answer: str', stream=True) and then you get the answer streamed in some way.

However, you cannot do these:

Not allowed: dspy.Predict('context, question -> answer: float', stream=True)
Not allowed: dspy.Predict('context, question -> reasoning: str, answer: str', stream=True)
Not allowed: dspy.ChainOfThought('context, question -> answer: str', stream=True)

I restrict this to string because other fields need validation (e.g. Pydantic).

Actually, if we don't have to worry about retries or bad outputs, it's technically possible to also support streaming for multiple string fields. It's also possible to support streaming any output fields but only stream non-str fields once the field value is fully completed. But we shouldn't start there.

Concretely, here's the question to everyone reading this:

Q: If we support streaming ONLY for usecases like dspy.Predict('.... any inputs here ... -> one_field_only: str', stream=True), will that be good enough?

danielschwartz4 commented 3 weeks ago

Just to throw it out there, the best case would be to support streaming for certain fields and not others. For example, I want to use reasoning but I don't actually care about the output of the reasoning, while I do care about the streamed answer perceived latency as @MohammedAlhajji mentions.

More realistically, it would be super useful to just stream the entire output of multi-field completions and I can parse the fields as they come in (even better dspy does the parsing). I understand this could lead to bad outputs, but perhaps a field dangerouslyStream could remediate this.

Supporting streaming for predicted strings would be a good start, but supporting streaming for multi-field completions would be powerful for my use case.

excubo-jg commented 3 weeks ago

One field (e.g. answer) would be sufficient for my current cases - there is strong demand for it from users

stanfordnlp / dspy

streaming after LiteLLM integration #1715