Redesign of Persistence & Pipeline Design

ppanopticon commented 7 months ago

This branch is used to track changes related to new persistence model (see #52), in which Operators do no longer persist information themselves. Instead, a specialized PersistingSink takes care of persistence.

Change required for adjustments to general design:

Analyser#newExtractor's field argument is now nullable. If field is null, the information generated by the Extractor is not persisted. The persisting attribute was removed.
Analyser#newExtractor's parameter has been removed. Parameters can now be provided in the IndexConfig (analogue to QueryConfig)
Some Extractors (e.g., CLIPExtractor or DINOExtractor require parameters typically provided in the field). These now check the different sources.

ppanopticon commented 7 months ago

@lucaro and @sauterl: In light of our planned get-together I invited you as reviewers so that we can discuss the design changes.

ppanopticon commented 6 months ago

Since the two issues of persistence and pipeline designed cannot be disentangled completely, this PR now addresses both (#51 & #52). The following changes have been made to how extraction pipelines can be generated:

Number of different Operators has been reduced. In fact, many of them are now shared between indexing and retrieval
Extraction pipelines have strongly relaxed constraints; different Operators can be mixed and matched freely. Only requirement, as of now, is that they must start with an Enumerator followed by a Decoder.
The Decoder now emits Retrievables instead of ContentElements; in the case of the VideoDecoder an explicit multiplexing of ContentElements is performed.
Operations that shape the stream (broadcasting, merging, concatenation) are now general purpose and not part of a specific Operators implementation
Added some transformative Operators that can be used, e.g., to map relationships or filter elements.

ppanopticon commented 6 months ago

{
  "schema": "V3C1",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName": "disk",
    "local": {
      "enumerator": {
        "path": "/Volumes/V3C1/V3C1/videos",
        "depth": "1"
      },
      "thumbs": {
        "path": "/Users/rgasser/Downloads/vitrivr-engine/images",
        "maxSideResolution": "350",
        "mimeType": "JPG"
      },
      "filter": {
        "type": "SOURCE:VIDEO" 
      }
    }
  },
  "operators": {
    "enumerator": { "type": "ENUMERATOR", "factory": "FileSystemEnumerator", "mediaTypes": ["VIDEO"]},
    "decoder": { "type": "DECODER", "factory": "VideoDecoder"  },
    "selector": { "type": "TRANSFORMER", "factory": "LastContentAggregator" },
    "avgColor": { "type": "EXTRACTOR", "fieldName": "averagecolor"},
    "file_metadata": { "type": "EXTRACTOR", "fieldName": "file" },
    "time_metadata": { "type": "EXTRACTOR", "fieldName": "time" },
    "video_metadata": { "type": "EXTRACTOR", "fieldName": "video" },
    "thumbs": { "type": "EXPORTER", "exporterName": "thumbnail" },
    "filter": { "type": "TRANSFORMER", "factory": "TypeFilterTransformer"}
  },
  "operations": {
    "enumerator": { "operator": "enumerator" },
    "decoder": { "operator": "decoder", "inputs": [ "enumerator" ] },
    "selector": { "operator": "selector", "inputs": [ "decoder" ] },
    "averagecolor": { "operator": "avgColor","inputs": ["selector"]},
    "thumbnails": {  "operator": "thumbs", "inputs": ["selector"] },
    "time_metadata": {  "operator": "time_metadata", "inputs": ["selector"] },
    "filter": {  "operator": "filter", "inputs": ["averagecolor", "thumbnails", "time_metadata"], "merge": "COMBINE" },
    "video_metadata": {  "operator": "video_metadata", "inputs": ["filter"] },
    "file_metadata": {  "operator": "file_metadata", "inputs": ["video_metadata"] }
  },
  "output": ["file_metadata"]
}

To get a feeling of the main features:

Basic structure of pipeline definition adhers to @sauterl's changes
Operations can branch (see "selector", which is used as input for multiple downstream operators)
Operations can merge (see "filter", which brings together multiple inputs)
Output of a pipeline can / must be defined explicitly. Multiple output's are possible (type of merging operation must be defined).

This basic example works on my machine. It iterates a folder of videos, decodes them with a 500ms multiplex window, selects the first content element for each emitted element, extracts some features, filters out the "source" element (for the file), extracts some file-related features and end emits this to the sink (where it is persisted).

All generated Descriptor and Relationship are kept in memory until persistence operations concludes.

vitrivr / vitrivr-engine

Redesign of Persistence & Pipeline Design #55