Open vivanenko opened 5 months ago
But does this param will be applicable for other vision models, not only the openai models? I think this param should be passed as a metadata.
I think that can be done with an imageContent metadata. I will add this to the openai connector later on.
I think that can be done with an imageContent metadata. I will add this to the openai connector later on.
@Krzysztof318 We need to think if we want to use metadata for these purposes or introduce new type for OpenAI specific image content (e.g. OpenAIImageContent
and keep it in OpenAI Connector.
We already have similar scenario with ChatMessageContent
and then OpenAIChatMessageContent
that contains OpenAI specific properties like ToolCalls
.
The right answer for question "which approach to use, metadata or derived type?" can be taken from usage perspective.
With metadata approach, this is how it's going to look like for user:
chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
new TextContent("What’s in this image?"),
new ImageContent(new Uri(ImageUri), metadata: new Dictionary<string, object?> { { "detail", "high" } })
});
With derived type approach, it will be a little bit simpler and most importantly - strongly-typed:
chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
new TextContent("What’s in this image?"),
new OpenAIImageContent(new Uri(ImageUri)) { Detail = "high" }
});
ImageContent
is currently sealed, and I believe it's a mistake. When I added AudioContent
, initially I made it sealed
as well, but then I reverted this change exactly for this scenario. I think we could do the same for ImageContent
.
So, before any implementation, I would recommend evaluating both approaches and choose one that will be the best in terms of usage.
Okay @dmytrostruk I will hold on with implementing this. Please keep in mind when you will think about solution that one more property may be needed. For example gemini rest api requires providing a mime type with image request. Now I have used metadata "mime_type" to support the gemini vision model. (You can see impl here #4957) So a derived type will look more elegant.
@dmytrostruk Have you thought about supporting derived and generic solution (metadata). Similar as it is done for ExecutionSettings? (We can pass openaipromptexecutionsetting or base class promptexecutionsettings with extra json data)
@dmytrostruk Have you thought about supporting derived and generic solution (metadata). Similar as it is done for ExecutionSettings? (We can pass openaipromptexecutionsetting or base class promptexecutionsettings with extra json data)
This could be one of the possible solutions as well.
Hi, with the GA release of GPT4 with vision on Azure last week, gpt-4-turbo-2024-04-09
, has this moved forward? I'm happy to contribute something if there is a clear path forward, but feel a bit too far removed to make design calls in this area.
In the short term, is there a way to work-around and set/force the detail level?
With derived type approach, it will be a little bit simpler and most importantly - strongly-typed:
We need to be thinking about consumers not knowing which service they're using, otherwise the abstractions fall apart. If this is on an OpenAI-specific content, then how do I specify detail to be used if possible if the target service understands it?
We need to be thinking about consumers not knowing which service they're using, otherwise the abstractions fall apart. If this is on an OpenAI-specific content, then how do I specify detail to be used if possible if the target service understands it?
That's actually interesting, because when I invoke IChatCompletionService
or Kernel
, I usually pass instance of OpenAIPromptExecutionSettings
, because I need to configure temperature
and other OpenAI-specific parameters. An abstraction still works, because in one place of application I can call OpenAI and in other I can call Hugging Face.
But case when developer doesn't know which service will be used - should be handled as well. Maybe property bag will work better after all. For PromptExecutionSettings
, we have ExtensionData
and I think it's (de)serialized correctly to include all necessary parameters. We can follow similar approach for ImageContent
. I also think this should be covered not only for ImageContent
, but for other Content
classes as well, to support API for multiple AI providers at the same time.
Tagging @RogerBarreto , since he is working on graduating Content
classes at the moment, and it's probably a good timing to consider this requirement as well.
@dmytrostruk Thanks for the tagging. This is an important topic and I follow also your suggestion on making an specialization on the ImageContent
from the connector size and unsealing
it already proposed in the #6319.
UbberService<TInput, TOutput>
analysisswitch
logicService
modality typesCurrently we have the PromptExecutionSettings
abstraction that is currently being used for ChatCompletion
and TextGeneration
, the same way can have leverage that abstraction for ImageGeneration
and other different modalities.
For this use case we can have a specialized OpenAITextToImageExecutionSettings
with Detail
property to set the API what is the expected level of detail during generation.
OpenAITextToImageExecutionSettings
can also used from Semantic Kernel service discovery (IServiceSelector
) perspective to identify service does support that type or ITextToImageService
interface and pre-select it for usage.
ImageContent should have "Detail" parameter which accepts values "high", "low" or "auto".
The detail parameter in the model offers three choices: low, high, or auto, to adjust the way the model interprets and processes images. The default setting is auto, where the model decides between low or high based on the size of the image input.
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision#detail-parameter-settings-in-image-processing-low-high-auto