Closed sauterl closed 7 months ago
I don't understand this definition. Why isn't "operations"
just a list? Also, in the previous version, one could specify branching operations, how would this be done here?
I don't understand this definition. Why isn't
"operations"
just a list?
In order to support branching and possibly merging in the future, in case these scenarios are considered.
Also, in the previous version, one could specify branching operations, how would this be done here?
I accidentally overlooked this, according to the require
statements in the predecessor version of the pipeline definition, the single point of branching is after a Segmenter
, where a list of Aggregator
s is provided. Therefore, I set this PR into draft mode and address branching.
It is still not clear to me what the semantics of the "operations"
block is supposed to be.
@lucaro The operations
block is supposed to be the named representation of the extraction operator tree.
Essentially, the operations
block is the replacement of all the next*
of the current definition, while the actual operation declaration has been moved to the operators
block.
But there is no defined order to them. There are some objects, each of which has only one operator, all of them have arbitrary keys that are not referenced anywhere. How does this define any sort of structure?
@lucaro The structure definition is added, as the updated PR description reflects.
To re-iterate, the operators
are OperatorConfig
s, declaring operators, i.e. are the nodes of the tree, while the operations
, in being OperationConfig
s, describing the relationships of the operators, i.e. are the vertices of the tree.
A valid, branching-including, index definition might look as follows:
{
"schema": "sandbox",
"context": {
"contentFactory": "InMemoryContentFactory",
"resolverFactory": "DiskResolver",
"parameters": {
"location": "./thumbnails/sandbox2"
}
},
"enumerator": {
"type": "ENUMERATOR",
"factory": "FileSystemEnumerator",
"parameters": {
"path": "./sandbox/imgs",
"mediaTypes": "IMAGE;VIDEO",
"depth": "1"
}
},
"decoder": {
"factory": "ImageDecoder"
},
"operators": {
"A": {
"type": "SEGMENTER",
"factory": "PassThroughSegmenter"
},
"B1": {
"type": "AGGREGATOR",
"factory": "AllContentAggregator"
},
"B2": {
"type": "AGGREGATOR",
"factory": "AllContentAggregator"
},
"C1": {
"type": "EXTRACTOR",
"fieldName": "averagecolor"
},
"C2": {
"type": "EXPORTER",
"exporterName": "thumbnail",
"parameters": {
"maxSideResolution": "350",
"mimeType": "JPG"
}
},
"D1": {
"type": "EXTRACTOR",
"fieldName": "file"
}
},
"operations": {
"stage1": {"operator": "A", "next": ["stage2-1","stage2-2"]},
"stage2-1": {"operator": "B1", "next": ["stage3-1"]},
"stage2-2": {"operator":"B2", "next": ["stage3-2"]},
"stage3-1": {"operator": "C1", "next": ["stage-4"]},
"stage3-2": {"operator": "C2"},
"stage4": {"operator": "D1"},
}
}
As of now, the pipeline is validated and up until #51 is implemented, branching is only allowed after SEGMENTER
, to multiple AGGREGATOR
s.
EDIT: Fixed broken JSON syntax
Since this PR appears to be based of the dev branch, I would suggest that it should also target the dev branch, so a review can cleanly distinguish between the contributions here and the changes inherited by the differences between main and dev.
Quick heads-up: The status of this PR is still a draft, since I think it is reasonable to also work in adjustements that ocur when merging #55
My suggestion: #55 -> dev -> this PR -> dev (with corresponding conflict resolving)
@lucaro Ready for review based on the off-line discussion we had yesterday.
Please check out the README, as the new definition language is properly documented there.
A working example (assuming a corresponding schema) for branched ingestion:
{
"schema": "sandbox",
"context": {
"contentFactory": "InMemoryContentFactory",
"resolverName": "disk",
"local": {
"fsenumerator": {
"path": "./sandbox/imgs",
"depth": "1"
},
"thumbs": {
"path": "./sandbox/thumbnails",
"maxSideResolution": "350",
"mimeType": "JPG"
}
}
},
"operators": {
"fsenumerator": {
"type": "ENUMERATOR",
"factory": "FileSystemEnumerator",
"mediaTypes": ["IMAGE","VIDEO"]
},
"decoder": {
"type": "DECODER",
"factory": "ImageDecoder"
},
"pass": {
"type": "SEGMENTER",
"factory": "PassThroughSegmenter"
},
"allContent": {
"type": "AGGREGATOR",
"factory": "AllContentAggregator"
},
"avgColor": {
"type": "EXTRACTOR",
"fieldName": "averagecolor"
},
"thumbs": {
"type": "EXPORTER",
"exporterName": "thumbnail"
},
"fileMeta": {
"type": "EXTRACTOR",
"fieldName": "file"
}
},
"operations": {
"stage2": {"operator": "pass", "inputs": ["stage1"]},
"stage0": {"operator": "fsenumerator"},
"stage1": {"operator": "decoder", "inputs": ["stage0"]},
"stage3": {"operator": "allContent", "inputs": ["stage2"]},
"stage4-1": {"operator": "avgColor", "inputs": ["stage3"]},
"stage4-2": {"operator": "thumbs", "inputs": ["stage3"]},
"stage4-3": {"operator": "fileMeta", "inputs": ["stage3"]}
}
}
In line with #27, this introduces a new extraction pipeline definition language, similar to the query language. Essentially, there is now a differentiation between the declaration of
Operators
and their ordering as a pipeline (theoperations
).An example
IngestionConfig
:Assuming the existence of a schema
sandbox
, with an exporterthumbs
and fieldsfile
,averagecolor
.This is, functionality-wise, a replacement of the previous pipeline definition (see
IndexConfig
and related). However, a lot of the design changes drafted in #51 can be supported with minimal changes, particularly with regards of the validation.