Support Multi Modal Input data types

Ankush-lastmile commented 10 months ago

Similar to how aiconfig sdk supports typed output data, for ex: OutputDataWithStringValue in schema.py, we need to add support for a similar structure to the input.

This can be as simple as something like

class AttachmentInputDataWithStringValue(BaseModel):
    """
    This represents the input data that is storied as a string, but we use
    both the `kind` field here and the `mime_type` to convert
    the string into the output format we want.
    """

    kind: Literal["file_uri", "base64"]
    value: str

Once that is done, we should cleanup any existing one-off implementations of input data types. As of writing, the AutomaticSpeechRecognition Model Parser defines this ad hoc and enforces this type. Address the todos and clean up callsites and usages as well.

To clarify, it tries to load the input data and throws on incompatibility. With the introduction of types into schema, this validation will be done at load() time. Instead of loading, simply check for existence of input data.

edit: ASR model parser does not use this

rossdanlm commented 10 months ago

Note: This is the exact same thing as OutputDataWithStringValue, so we should probalby just combine it to be DataWithStringValue so it can be used as both input and output

Ankush-lastmile commented 10 months ago

There are now multiple model parser that support non-text inputs.

HuggingFace Image to Text, and Automatic Speech Recognition. Both RemoteInference and Transformers

Both of these model parsers support file path inputs. Binary Data inputs should be possible, but is a todo to be supported.

Steps from here:

Figure out how to validate binary data input
Update the 4 model parsers to support the binary data input (or validate it works out of the box, address TODOs)

Additional Stretch Goals would be adding support for GPT-4v & Gemini Vision (pro)

lastmile-ai / aiconfig

Support Multi Modal Input data types #829