[WIP] Major refactor roadmap

okhat commented 8 months ago

DSPy has a small number (maybe 5-6) of extremely powerful concepts that have grown organically over the past year as open source.

Internally, it's time for a major refactor that will simplify things greatly and make sure everything works more smoothly. I have received a lot of interest from the community to contribute to this, so we just need to make sure the goals and milestones are clearly defined.

Potential leaders in this space include @CyrusOfEden @stalkermustang (and from the internal DSPy side possibly @krypticmouse and @arnavsinghvi11 but I haven't checked with them) and there has been a lot of in-depth shots at this from @thomasahle so I'm curious if he's also interested broadly in this.

Starting this issue just to collect the necessary tasks and prioritize them in the right dependency order.

Off the top of my head, I think we have to have:

Cleaner LM abstraction that requires a lot less work to maintain and is clearer about the boundaries. The amazing @CyrusOfEden has already defined this on paper. This will include cleaner "Backend" abstraction, which is a bit of a wrapper around LMs that does template adaptation and other things that Cyrus has in mind.
Cleaner Signature abstraction. I think the push by @thomasahle here is perfectly on the right track. Types, and immutable signatures. We just need to make more decisions about how far down Pydnatic/Types we go, and how far down, say, SGLang we go, or having our own logic, etc. I do like outsourcing parsing logic and type logic but we need to make sure it doesn't break existing features.
Cleaner Modules. This is actually easy but needs to be streamlined. Predict, CoT need to be Modules. And they need to store instructions (not leave it to the signature). They need to handle multiple outputs more consistently. This can be any of us really, esp @arnavsinghvi11, me, @krypticmouse , @thomasahle if any of the folks is interested.
Cleaner Optimizers. Well, we can leave this to me and it's a final easy step once signatures and modules are good.
More guidance on common patterns in the docs. We now finally have docs for all the key components and we have plenty of individual examples. But we do not have enough guidance on the common e2e workflows. This also partly includes clear guidance on what people should do with local LMs: ollama for cpu, tgi or sglang or vllm(?) for gpus? What about quantization etc.

Thoughts? Other things we need?

Production concerns? Streaming. Async.
Docstrings.
Various individual issues.

Dumping list of stuff:

Assertions now has an open issue to collect things needing improvement
Don't fail silently if forgot kwarg passed to Predict or pass incorrect name

CyrusOfEden commented 7 months ago

@okhat it's more for deployment —

when you're using threading to compile / run once it doesn't make all that big a difference

in production async uses way less memory

chris-boson commented 7 months ago

@chris-boson You can already do this now, right? I think Cyrus backend work will allow types to go through function calling, or schema APIs or whichever way the underlying model prefers.

@thomasahle Yes, it works, just a bit cumbersome to explicitly call the typed versions. Good to know we're working on integrating function calling!

Generally having type annotations / pydantic flow through the entire stack would make it significantly more useful when interacting with APIs or other traditional software systems where structured output is important. Also types constrain the problem space and can be checked with something like mypy to uncover many issues early. I think that would mesh very well with the idea of "programming" LLMs. I think the way instructor is going looks very promising.

AndreasMadsen commented 7 months ago

@okhat I think you are assuming a single LLM server, in which case each thread makes one call to the server at a time, and then you can easily synchronize the throttle. However, if you are doing requests to different LLM servers, then that won't work. That becomes relevant for both load balancing and when different models are used in the same pipeline.

Consider this example. In the run_sync case summary and reasoning are computed one after the other, when they could be computed simultaneously. Running them simultaneously is hard to do with the thread model, but easy with the async-await model (I removed the with statements for simplicity, but you could keep them while still using async-await). You also don't need to worry about thread-safety, etc. Since there is just a single thread.

Of course, summary + reasoning and sentiment are still separated. So to fully saturate the inference server the number of parallel tasks (run_async) would need to be greater than the throttle threshold, and the throttle needs to be implemented in the Client, not the Evaluator, but that would also be the case in the thread model. Additionally, this would enable to have different throttles for different clients.

summary = dspy.Predict('sentence -> summary')
reasoning = dspy.Predict('sentence -> reasoning')
sentiment = dspy.Predict('summary, reasoning -> sentiment')

def run_sync():
  with dspy.context(lm=bart):
    s = summary(sentence=sentence)
  with dspy.context(lm=llama):
    r = reasoning(sentence=sentence)
  with dspy.context(lm=t5):
    return sentiment(summary=s, reasoning=r)

async def run_async(): 
  s, r = await asyncio.gather(
    summary(sentence=sentence, lm=bart),
    reasoning(sentence=sentence, lm=llama)
  )
  with dspy.context(lm=t5):
    return await sentiment(summary=s, reasoning=r)

As a side note. Threads are also hard to debug, because you can get intermingled print statements. That is never the case with the async-await.

CyrusOfEden commented 7 months ago

@AndreasMadsen I agree that async is the way to go to make DSPy more useful in production settings, and more elegant. In the meantime, might I recommend this way forward? I understand it's not 100% what you're looking for, but maybe in the meantime it unblocks you.

Here's an example of what you could do:

from asgiref.sync import sync_to_async

program_a = ...
program_b = ...

async_program_a = sync_to_async(thread_sensitive=False)(program_a)
async_program_b = sync_to_async(thread_sensitive=False)(program_b)

with dspy.context(lm=x):
    a, b = await asyncio.gather(
        async_program_a(*args, **kwargs),
        async_program_b(*args, **kwargs)
    )

I like (and share) your idea of passing the LM directly to the program, because right now the code above (is probably) not compatible with different LMs per-concurrent-program. I'll noodle on that further as part of the backend refactor :-)

drawal1 commented 7 months ago

Modules is an area that needs refactoring, for sure. The problem is that dspy modules conflate the hosting provider (Bedrock, Sagemaker, Azure, ...) with the model (Mixtral8x7B, Haiku, ...).

A related issue is that DSPy prompts are currently model-agnostic, but the best results do require a model-aware prompt. People have pointed this out on Discord or come up with various hacks to simulate system prompts etc.

Is this work in-scope? I see 'LiteLLM' posts here that seem related but not quite

Simplest solution here would be for modules to be per-model and require a "hosting-provider" object in the constructor OR modules to be per-hosting-provider and require a "model" object in the constructor

KCaverly commented 7 months ago

To catch this thread up, weve got a new backend infrastructure up and basically ready to review/merge. This should offer the following:

LiteLLM integration, 100+ models out of the box.
Global caching for lm calls.
Chat mode and dynamic prompting.
JSON mode for applicable models.
A modularized, type safe and tested API to build from.

drawal1 commented 6 months ago

@KCaverly - Arize Phoenix depends on all LM classes being under the 'dsp' module. Something to consider testing before merge

den-run-ai commented 6 months ago

i am interested in integration with LLM-based optimization techniques such as OPRO:

https://github.com/google-deepmind/opro

krypticmouse commented 6 months ago

@den-run-ai I think COPRO is loosely based on OPRO to work for multi-stage pipelines.

vilkinsons commented 4 months ago

JSON mode for applicable models.

@KCaverly Is there any publicly available info about this? Would be happy to help test this capability in the context of Claude 3/GPT-4 family models. Thank you!

glesperance commented 4 months ago

@vilkinsons there's this branch https://github.com/stanfordnlp/dspy/tree/backend-refactor

glesperance commented 2 months ago

@KCaverly is this still in progress or has the goal moved away from this refactor?

hawktang commented 1 month ago

I saw the new dspy multiopenai (https://dspy-docs.vercel.app/api/language_model_clients/MultiOpenAI)

It can give support for litellm proxy

however, can we also add a class in sentence_vectorizer to give a quick solution for litellm embedding.

I have a PR add LiteLLMVectorizer to allow litellm embeddings to be used (https://github.com/stanfordnlp/dspy/pull/1240)

I have change the name to MultiOpenAIVectorizer to follow the new MultiOpenAI api in dspy. Can we merge this PR for general embedding service in DSPY RM with general openai api format including litellm proxy, which make dsyp support prediction and embedding at same time for general MultiOpenAI api.

stanfordnlp / dspy

[WIP] Major refactor roadmap #390