Open ppanopticon opened 2 weeks ago
I agree with the points you raised and am not a fan of the append-only semantic for all operations. I think we need to fundamentally define what types of operators can do what kind of operations and then also put mechanisms in place that ensure a certain level of consistency. Various functionalities are currently handled by each feature separately (e.g., filtering by content source), which can easily lead to unexpected behavior when a single feature implements such common functionality differently for some reason. For several types of operators, there is also no need to operate directly on the flow, as the risks of breaking something, in my view, greatly outweigh the flexibility, which is often not even desired at that point. This would all certainly benefit from some re-thinking.
Description
One of the recent pull requests to
dev
has seen the introduction of theContentAuthorAttribute
with the idea, thatContentElement
s can be labeled and selectively processed based on the operator it was created by. Furthermore, and as a side-effect thereof,ContentElements
are always appended to anIngested
to make sure that different execution paths in the pipeline have access to all the necessary information. Or in other words,Retrievable
s have become strictly append-only objects that accumulate different content representations (and other data structures such asDescriptor
s).While I understand the desire to have the first part I see the second part of this mechanism highly critical for multiple reasons, not the least of which being, that I immediately run into issues even for very simple examples. Here are my observations in a nutshell:
The current approach is taxing on memory. Even for very simple pipelines (Decode > Aggregate > Extract) I run into out-of-memory issues within seconds. The reason is simple: The aggregation step no longer frees memory since all versions of the content are kept around until the video is fully extracted. Of course this can be worked around by adding more memory (unreliable) or using the
CachedContentFactory
(slow). But I think it's less than ideal, that it is no longer possible to construct pipelines with low memory footprints.The approach adds a lot of complexity (which is currently poorly documented). Again, even in this simple example, I'm forced to somehow specify which of the many
ContentElement
s I actually want to use, when it is actually self-evident from the pipeline setup. Without doing so, extraction takes place on all the content.For me it is also unclear, how this mechanism behaves in more complex scenarios where we do aggregation and / or segmentation of
Retrievable
s. What happens, for example, if we create new segments (i.e., newRetrievable
s) that replace the incoming ones? We can of course emit the newRetrievable
for processing. But since the sourceRetrievable
's relationships cannot be changed, both segmentations are kept around and - by the current logic - are thus persisted.Overall I get the feeling, that we have added a lot of complexity to cover a specific edge case. This complexity seems to have a negative impact on the cases we cover on a regular basis. And to me it seems, that there remain open questions as to how this mechanism should behave in different scenarios.
Therefore, before expanding upon this feature, I would like to stop, pause and think about whether we're headed in the right direction here. This issue should be used to document the discussion and come-up with a design specification. Maybe this is something that needs to be discussed during one of our meetings.
@lucaro and @faberf: Since this is somehow your brain-child, I added you as assignees.