stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.26k stars 1.47k forks source link

Running batch predictions using DSPy complied prompts on large dataset #1208

Open sarora-roivant opened 5 months ago

sarora-roivant commented 5 months ago

Hi,

I've implemented a clinical entity extraction pipeline using DSPy for processing patient notes. The pipeline extracts various entities (drugs, diseases, procedures, lab tests) and performs condition assessments. Currently, I'm facing challenges in scaling this pipeline to process a large dataset of approximately 500,000 notes efficiently.

Current Implementation:

  1. Data loading
  2. Note aggregation and preprocessing
  3. Entity extraction using DSPy signatures and predictors
  4. Condition assessment using custom DSPy modules
  5. Result processing and export

Challenges:

  1. Processing time: Currently, it takes about 5-6 minutes to process a single note.
  2. Lack of native batch processing: Each note is processed individually, leading to inefficient use of API calls and resources.
  3. Scaling difficulties: The current approach is not feasible for processing 500,000 notes in a reasonable timeframe.

Questions for Scaling:

  1. What is the recommended approach for using DSPy with large datasets (~500,000 notes)?
  2. Are there any best practices for compiling DSPy prompts for batch processing?
  3. How can we optimize the use of compiled DSPy prompts in a distributed computing environment?

Any guidance on efficiently scaling DSPy for large-scale entity extraction tasks would be greatly appreciated. I'm open to restructuring my pipeline or adopting new approaches to achieve better performance.

okhat commented 5 months ago

Use dspy.evaluate.Evaluate and pass num_threads. For the metric, just pass a metric that always returns True or False