Running batch predictions using DSPy complied prompts on large dataset

Hi,

I've implemented a clinical entity extraction pipeline using DSPy for processing patient notes. The pipeline extracts various entities (drugs, diseases, procedures, lab tests) and performs condition assessments. Currently, I'm facing challenges in scaling this pipeline to process a large dataset of approximately 500,000 notes efficiently.

Current Implementation:

Data loading
Note aggregation and preprocessing
Entity extraction using DSPy signatures and predictors
Condition assessment using custom DSPy modules
Result processing and export

Challenges:

Processing time: Currently, it takes about 5-6 minutes to process a single note.
Lack of native batch processing: Each note is processed individually, leading to inefficient use of API calls and resources.
Scaling difficulties: The current approach is not feasible for processing 500,000 notes in a reasonable timeframe.

Questions for Scaling:

What is the recommended approach for using DSPy with large datasets (~500,000 notes)?
Are there any best practices for compiling DSPy prompts for batch processing?
How can we optimize the use of compiled DSPy prompts in a distributed computing environment?

Any guidance on efficiently scaling DSPy for large-scale entity extraction tasks would be greatly appreciated. I'm open to restructuring my pipeline or adopting new approaches to achieve better performance.

stanfordnlp / dspy

Running batch predictions using DSPy complied prompts on large dataset #1208