microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
20.53k stars 2.97k forks source link

Add support for a "detail" parameter when passing an ImageContent into ChatHistory. #4759

Open vivanenko opened 5 months ago

vivanenko commented 5 months ago

ImageContent should have "Detail" parameter which accepts values "high", "low" or "auto".

The detail parameter in the model offers three choices: low, high, or auto, to adjust the way the model interprets and processes images. The default setting is auto, where the model decides between low or high based on the size of the image input.

Знімок екрана 2024-01-26 о 17 57 22

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision#detail-parameter-settings-in-image-processing-low-high-auto

Krzysztof318 commented 5 months ago

But does this param will be applicable for other vision models, not only the openai models? I think this param should be passed as a metadata.

Krzysztof318 commented 4 months ago

I think that can be done with an imageContent metadata. I will add this to the openai connector later on.

dmytrostruk commented 4 months ago

I think that can be done with an imageContent metadata. I will add this to the openai connector later on.

@Krzysztof318 We need to think if we want to use metadata for these purposes or introduce new type for OpenAI specific image content (e.g. OpenAIImageContent and keep it in OpenAI Connector.

We already have similar scenario with ChatMessageContent and then OpenAIChatMessageContent that contains OpenAI specific properties like ToolCalls.

The right answer for question "which approach to use, metadata or derived type?" can be taken from usage perspective.

With metadata approach, this is how it's going to look like for user:

chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
    new TextContent("What’s in this image?"),
    new ImageContent(new Uri(ImageUri), metadata: new Dictionary<string, object?> { { "detail", "high" } })
});

With derived type approach, it will be a little bit simpler and most importantly - strongly-typed:

chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
    new TextContent("What’s in this image?"),
    new OpenAIImageContent(new Uri(ImageUri)) { Detail = "high" }
});

ImageContent is currently sealed, and I believe it's a mistake. When I added AudioContent, initially I made it sealed as well, but then I reverted this change exactly for this scenario. I think we could do the same for ImageContent.

So, before any implementation, I would recommend evaluating both approaches and choose one that will be the best in terms of usage.

Krzysztof318 commented 4 months ago

Okay @dmytrostruk I will hold on with implementing this. Please keep in mind when you will think about solution that one more property may be needed. For example gemini rest api requires providing a mime type with image request. Now I have used metadata "mime_type" to support the gemini vision model. (You can see impl here #4957) So a derived type will look more elegant.

Krzysztof318 commented 4 months ago

@dmytrostruk Have you thought about supporting derived and generic solution (metadata). Similar as it is done for ExecutionSettings? (We can pass openaipromptexecutionsetting or base class promptexecutionsettings with extra json data)

dmytrostruk commented 4 months ago

@dmytrostruk Have you thought about supporting derived and generic solution (metadata). Similar as it is done for ExecutionSettings? (We can pass openaipromptexecutionsetting or base class promptexecutionsettings with extra json data)

This could be one of the possible solutions as well.

NickDrouin commented 1 month ago

Hi, with the GA release of GPT4 with vision on Azure last week, gpt-4-turbo-2024-04-09, has this moved forward? I'm happy to contribute something if there is a clear path forward, but feel a bit too far removed to make design calls in this area.

In the short term, is there a way to work-around and set/force the detail level?

stephentoub commented 1 month ago

With derived type approach, it will be a little bit simpler and most importantly - strongly-typed:

We need to be thinking about consumers not knowing which service they're using, otherwise the abstractions fall apart. If this is on an OpenAI-specific content, then how do I specify detail to be used if possible if the target service understands it?

dmytrostruk commented 1 month ago

We need to be thinking about consumers not knowing which service they're using, otherwise the abstractions fall apart. If this is on an OpenAI-specific content, then how do I specify detail to be used if possible if the target service understands it?

That's actually interesting, because when I invoke IChatCompletionService or Kernel, I usually pass instance of OpenAIPromptExecutionSettings, because I need to configure temperature and other OpenAI-specific parameters. An abstraction still works, because in one place of application I can call OpenAI and in other I can call Hugging Face.

But case when developer doesn't know which service will be used - should be handled as well. Maybe property bag will work better after all. For PromptExecutionSettings, we have ExtensionData and I think it's (de)serialized correctly to include all necessary parameters. We can follow similar approach for ImageContent. I also think this should be covered not only for ImageContent, but for other Content classes as well, to support API for multiple AI providers at the same time.

Tagging @RogerBarreto , since he is working on graduating Content classes at the moment, and it's probably a good timing to consider this requirement as well.

RogerBarreto commented 1 month ago

@dmytrostruk Thanks for the tagging. This is an important topic and I follow also your suggestion on making an specialization on the ImageContent from the connector size and unsealing it already proposed in the #6319.

ImageContent Graduation

On the subject of abstract configurations

Thinking on how Kernel can support the other Service Modality types which bring us back to:

Add Kernel support for other Service modality types

Currently we have the PromptExecutionSettings abstraction that is currently being used for ChatCompletion and TextGeneration, the same way can have leverage that abstraction for ImageGeneration and other different modalities.

For this use case we can have a specialized OpenAITextToImageExecutionSettings with Detail property to set the API what is the expected level of detail during generation.

Kernel Multiple Modalities Discovery

OpenAITextToImageExecutionSettings can also used from Semantic Kernel service discovery (IServiceSelector) perspective to identify service does support that type or ITextToImageService interface and pre-select it for usage.