Extraction with multiple enumerators or multiple decoders

faberf commented 6 months ago

For the use-case of extracting ASR features at a more course-grained level than clip features, it is useful to be able to define a pipeline where either:

a video is decoded at a very fine grained level, whose segments are then differently grouped by differently configured transformers.
a video is decoded twice, once at a fine grained level and once at a course grained level
at the very least the video file is enumerated twice, leading to two seperate video source retrievables

My impression is that maybe 1 is more useful down the line, but 2 and 3 should be quite easy to implement but are currently not fully supported (or at least I haven't figured out how to implement them).

video-multiple-decoders.json This pipeline decodes videos twice and properly creates temporal metadata descriptors and segments of different granularity, however, it does not persist anything. When I remove "long-decoder-stage" from the input of "time-stage", then the pipeline only uses a single decoder and ends up persisting everything properly.

Digging into this, in line 60 in IngestionPipelineBuilder (commit ce53093f76eb5055f997b3e64ec11ce24161547e) the enumerator is not checked to have multiple outputs (as is the case for other operators in line 111) and, if necessary, wrapped in a broadcast operator. A simple fix (checking and wrapping) does not work, as the decoder expects an Enumerator as input and a BroadcastOperator is not an Enumerator.

Point 3, using multiple enumerators, also doesn't seem to work. video-multiple-enumerators.json this pipeline gives Dangling operators are not supported

As a minor side note: it would make sense if the file metadata extraction would already work immediately after enumeration, but currently a decoding stage seems to be necessary. This is probably not a big issue in practice, though.

lucaro commented 6 months ago

Only option 1 sounds reasonable to me. Generally, having multiple enumerators does not make a lot of sense. Having multiple decoders only makes sense if you have a mixed collection with multiple media types. Decoding the same document multiple times will always be less efficient than decoding it once, so that is what should be done whenever possible.

faberf commented 6 months ago

Generally I agree. @ppanopticon mentioned today that points 2 and 3 are probably supported and a good interim solution and this issue was also intended as a reply to this. Nevertheless, silently failing and not persisting anything is strange behaviour.

faberf commented 6 months ago

Update: it turns out that with video-multiple-decoders.json I was incorrectly using COMBINE when I should have used MERGE. Switching to MERGE resolves this issue.

vitrivr / vitrivr-engine

Extraction with multiple enumerators or multiple decoders #77