Extending Content Author to support more finegrained tagging

faberf commented 2 months ago

Using the DescriptorAsContentTransformer, descriptors such as FileMetadata can be re-input into the pipeline as content, for instance in order to create a prompt for later captioning. So far, complex struct descriptors have been ignored, since they do not map onto content easily. In order to support the usecase of including exif metadata in prompts for image captioning I have made the following main contribution:

The DescriptorAsContentTransformer now uses the ContentAuthorAttributes as a rudimentary tagging system to load all the values of an AnyMapStructDescriptor into the content of a retrievable while maintaining the keys.

For instance, if the name of the operator is "exif_content_transformer" and the names of the subfields are "location" and "date" then the transformer will transform a descriptor into two text content elements, tag them both with "exif_content_transformer", tag the location content with "exif_content_transformer.location" and tag the date content with "exif_content_transformer.date".

We should have a discussion, if this is intended behaviour of our tagging system. If so, let us maybe rename ContentAuthorAttribute to ContentTagAttribute and set CONTENT_AUTHORS_KEY from contentSources (which was inconsistent anyways) to tagWhiteList or something similar. Maybe we might want to make this kind of "namespace" approach "operator.tag" more rigorous.

Additionally:

The ExifMetadata Extractor can now extract more complex json objects as strings by simply keeping the raw json

faberf commented 1 month ago

@lucaro Do you have any feedback on this?

lucaro commented 1 month ago

I did not get around to look at it yet, will do so as soon as I'm able.

faberf commented 1 month ago

Can somebody help me out with this test? Why am I getting failed tests on functionality I didn't touch?

faberf commented 1 month ago

Can somebody help me out with this test? Why am I getting failed tests on functionality I didn't touch?

Ah ok upon rerunning the tests they passed... spooky

vitrivr / vitrivr-engine

Extending Content Author to support more finegrained tagging #114