Proposed repository structure

SkalskiP commented 10 months ago

Proposed Code Structure

Every prompting pipeline comes with prompt_creator and result_processor. You can manually instantiate instances of those classes or call pipeline function providing name argument.

from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv

class BasePromptCreator(ABC):
    @abstractmethod
    def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
        """
        Create a prompt from an image and additional arguments.

        Args:
            image (np.ndarray): The input image.
            *args, **kwargs: Additional arguments.

        Returns:
            Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
        """
        pass

class BaseResultProcessor(ABC):
    @abstractmethod
    def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
        """
        Process the results with given text and detections.

        Args:
            text (str): The input text.
            marks (sv.Detections): Detections to be used in processing.
            *args, **kwargs: Additional arguments.

        Returns:
            Dict[str, str]: Processed results.
        """
        pass

    @abstractmethod
    def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
        """
        Visualize the results on an image.

        Args:
            text (str): The input text.
            image (np.ndarray): The input image.
            marks (sv.Detections): Detections to be visualized.
            *args, **kwargs: Additional arguments.

        Returns:
            np.ndarray: The image with visualizations.
        """
        pass

class SamPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
        pass

class SamResultProcessor(BaseResultProcessor):

    def process(text: str, marks: sv.Detections) -> List[str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass

class GroundingDinoPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
        pass

class GroundingDinoResultProcessor(BaseResultProcessor):

    def process(text: str, marks: sv.Detections) -> Dict[str, str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass

PIPELINES = {
    'sam': (SamPromptCreator, SamResultProcessor),
    'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}

def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
    """Retrieves the prompt creator and result processor for the specified pipeline.

    Args:
        name (str): The name of the pipeline.
        **kwargs: Additional keyword arguments for initializing the classes.

    Returns:
        Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.

    Raises:
        ValueError: If the pipeline name is not in the PIPELINES dictionary.
    """
    pipeline_classes = PIPELINES.get(name)

    if pipeline_classes is None:
        raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")

    PromptCreatorClass, ResultProcessorClass = pipeline_classes

    prompt_creator = PromptCreatorClass(**kwargs)
    result_processor = ResultProcessorClass(**kwargs)

    return prompt_creator, result_processor

Example Usage

LMM inference gets sandwiched between prompt_creator and result_processor calls.

import cv2
from maestro import pipeline, prompt_gpt4_vision

prompt_creator, result_processor = pipeline('sam', device='cuda')

image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'

response = prompt_gpt4_vision(
    text_prompt=text_prompt, 
    image_prompt=image_prompt, 
    api_key=api_key)

visualization = result_processor.visualize(
    text=response, 
    image=image, 
    marks=marks)

PawelPeczek-Roboflow commented 10 months ago

Looks good as a baseline, I am just wondering change in this theme would be more verbose:

maestro = build_maestro('sam', device='cuda').with("gpt-4")
result = maestro.prompt("Find a dog").with_image(image).visualize()

Naming conventions to be agreed - I just would like to point out that usage of prompt_creator and result_processor with custom things (that cannot be fully custom) in between - may bring confusion for less advanced users - especially that result_processor probably assumes some structure of response that may not be guaranteed given that client uses their own logic instead of prompt_gpt4_vision()

for more advanced use cases, however - I would let .with("gpt-4") to be replaced with .with(my_callable) where my_callable takes agreed parameters and clients can inject implementation.

yeldarby commented 10 months ago

This makes sense to me for set of marks style prompts where you're annotating an image.

I think we may want to have some aspirational things that we may implement some day that we're keeping in mind as we design the API structure. Some thoughts on potential future directions of exploration:

Chaining - taking the output of one response, doing another transformation, and passing it back (eg "find the dog" -> it finds it -> we crop the photo to isolate the object of interest -> "describe this dog")
Few-shot - pulling similar images (and captions/annotations) from a vector DB & passing them along with your prompt to show by example what you want (or "spot the difference" style prompting against a reference image)
RAG - pulling relevant images from a vector DB to add additional context
Temporal / Video - to help with eg the sports broadcasting example
Tool use - using another model like a fine-tuned CNN to be able to add additional context
Integration with existing tools like LangChain (so you can eg us these prompting techniques as part of agent flows)

SkalskiP commented 10 months ago

Cool! I'll keep that in mind. We had a call with @PawelPeczek-Roboflow. We agreed on PromptCreator and ResultProcessor structure. Those can encapsulate a lot of the logic you just described. We just need to make sure the top layer allows to freely pass versions arguments. But because we are still not sure what we want to support we'll add high level API at the very end.

SkalskiP commented 1 month ago

we are changing the profile of the project, making these old ideas obsolete

roboflow / maestro

Proposed repository structure #6

Proposed Code Structure

Example Usage