Redesign of the extraction pipeline definition

sauterl commented 7 months ago

In line with #27, this introduces a new extraction pipeline definition language, similar to the query language. Essentially, there is now a differentiation between the declaration of Operators and their ordering as a pipeline (the operations).

An example IngestionConfig:

{
  "schema": "sandbox",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName": "disk",
    "local": {
      "fsenumerator": {
        "path": "./sandbox/imgs",
        "depth": "1"
      },
      "thumbs": {
        "path": "./sandbox/thumbnails",
        "maxSideResolution": "350",
        "mimeType": "JPG"
      }
    }
  },
  "operators": {
    "fsenumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["IMAGE","VIDEO"]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "ImageDecoder"
    },
    "pass": {
      "type": "SEGMENTER",
      "factory": "PassThroughSegmenter"
    },
    "allContent": {
      "type": "AGGREGATOR",
      "factory": "AllContentAggregator"
    },
    "avgColor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "thumbs": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "fileMeta": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    }
  },
  "operations": {
    "stage2": {"operator": "pass", "inputs": ["stage1"]},
    "stage0": {"operator": "fsenumerator"},
    "stage1": {"operator": "decoder", "inputs": ["stage0"]},
    "stage3": {"operator": "allContent", "inputs": ["stage2"]},
    "stage4": {"operator": "avgColor", "inputs": ["stage3"]},
    "stage5": {"operator": "thumbs", "inputs": ["stage4"]},
    "stage6": {"operator": "fileMeta", "inputs": ["stage5"]}
  }
}

Assuming the existence of a schema sandbox, with an exporter thumbs and fields file, averagecolor.

This is, functionality-wise, a replacement of the previous pipeline definition (see IndexConfig and related). However, a lot of the design changes drafted in #51 can be supported with minimal changes, particularly with regards of the validation.

lucaro commented 7 months ago

I don't understand this definition. Why isn't "operations" just a list? Also, in the previous version, one could specify branching operations, how would this be done here?

sauterl commented 7 months ago

I don't understand this definition. Why isn't "operations" just a list?

In order to support branching and possibly merging in the future, in case these scenarios are considered.

Also, in the previous version, one could specify branching operations, how would this be done here?

I accidentally overlooked this, according to the require statements in the predecessor version of the pipeline definition, the single point of branching is after a Segmenter, where a list of Aggregators is provided. Therefore, I set this PR into draft mode and address branching.

lucaro commented 7 months ago

It is still not clear to me what the semantics of the "operations" block is supposed to be.

sauterl commented 7 months ago

@lucaro The operations block is supposed to be the named representation of the extraction operator tree.

Essentially, the operations block is the replacement of all the next* of the current definition, while the actual operation declaration has been moved to the operators block.

lucaro commented 7 months ago

But there is no defined order to them. There are some objects, each of which has only one operator, all of them have arbitrary keys that are not referenced anywhere. How does this define any sort of structure?

sauterl commented 7 months ago

@lucaro The structure definition is added, as the updated PR description reflects.

To re-iterate, the operators are OperatorConfigs, declaring operators, i.e. are the nodes of the tree, while the operations, in being OperationConfigs, describing the relationships of the operators, i.e. are the vertices of the tree.

A valid, branching-including, index definition might look as follows:

{
  "schema": "sandbox",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverFactory": "DiskResolver",
    "parameters": {
      "location": "./thumbnails/sandbox2"
    }
  },
  "enumerator": {
    "type": "ENUMERATOR",
    "factory": "FileSystemEnumerator",
    "parameters": {
      "path": "./sandbox/imgs",
      "mediaTypes": "IMAGE;VIDEO",
      "depth": "1"
    }
  },
  "decoder": {
    "factory": "ImageDecoder"
  },
  "operators": {
    "A": {
      "type": "SEGMENTER",
      "factory": "PassThroughSegmenter"
    },
    "B1": {
      "type": "AGGREGATOR",
      "factory": "AllContentAggregator"
    },
    "B2": {
    "type": "AGGREGATOR",
    "factory": "AllContentAggregator"
  },
    "C1": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "C2": {
      "type": "EXPORTER",
      "exporterName": "thumbnail",
      "parameters": {
        "maxSideResolution": "350",
        "mimeType": "JPG"
      }
    },
    "D1": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    }
  },
  "operations": {
    "stage1": {"operator": "A", "next": ["stage2-1","stage2-2"]},
    "stage2-1": {"operator": "B1", "next": ["stage3-1"]},
    "stage2-2": {"operator":"B2", "next": ["stage3-2"]},
    "stage3-1": {"operator": "C1", "next": ["stage-4"]},
    "stage3-2": {"operator": "C2"},
    "stage4": {"operator": "D1"},
  }
}

As of now, the pipeline is validated and up until #51 is implemented, branching is only allowed after SEGMENTER, to multiple AGGREGATORs.

EDIT: Fixed broken JSON syntax

lucaro commented 7 months ago

Since this PR appears to be based of the dev branch, I would suggest that it should also target the dev branch, so a review can cleanly distinguish between the contributions here and the changes inherited by the differences between main and dev.

sauterl commented 7 months ago

Quick heads-up: The status of this PR is still a draft, since I think it is reasonable to also work in adjustements that ocur when merging #55

My suggestion: #55 -> dev -> this PR -> dev (with corresponding conflict resolving)

sauterl commented 7 months ago

@lucaro Ready for review based on the off-line discussion we had yesterday.

Please check out the README, as the new definition language is properly documented there.

sauterl commented 7 months ago

A working example (assuming a corresponding schema) for branched ingestion:

{
  "schema": "sandbox",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName": "disk",
    "local": {
      "fsenumerator": {
        "path": "./sandbox/imgs",

        "depth": "1"
      },
      "thumbs": {
        "path": "./sandbox/thumbnails",
        "maxSideResolution": "350",
        "mimeType": "JPG"
      }
    }
  },
  "operators": {
    "fsenumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["IMAGE","VIDEO"]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "ImageDecoder"
    },
    "pass": {
      "type": "SEGMENTER",
      "factory": "PassThroughSegmenter"
    },
    "allContent": {
      "type": "AGGREGATOR",
      "factory": "AllContentAggregator"
    },
    "avgColor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "thumbs": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "fileMeta": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    }
  },
  "operations": {
    "stage2": {"operator": "pass", "inputs": ["stage1"]},
    "stage0": {"operator": "fsenumerator"},
    "stage1": {"operator": "decoder", "inputs": ["stage0"]},
    "stage3": {"operator": "allContent", "inputs": ["stage2"]},
    "stage4-1": {"operator": "avgColor", "inputs": ["stage3"]},
    "stage4-2": {"operator": "thumbs", "inputs": ["stage3"]},
    "stage4-3": {"operator": "fileMeta", "inputs": ["stage3"]}
  }
}

vitrivr / vitrivr-engine

Redesign of the extraction pipeline definition #53