Pipeline that Starts with Previously Segmented Media

vitrivr / vitrivr-engine

vitrivr's next-generation retrieval engine

MIT License

4 stars 2 forks source link

Pipeline that Starts with Previously Segmented Media #63

Open faberf opened 3 months ago

faberf commented 3 months ago

Consider the usecase where a large catalogue has already been segmented and some features have been extracted. Now, an additional feature needs to be extracted and connected to the existing segments. Currently, there is no practical way to do this (AFAIK) and the entire pipeline needs to be rerun.

I propose implementing an operator that retrieves segments that have been persisted, along with their source attributes. This operator would be the initial operator in the extraction pipeline for new features. I am not sure if retrieval at indexing time is meant to work with the existing querying system or if some problems will arise here. Also, in current pipeline configs the enumerators must come first, so work is needed here as well.

sauterl commented 3 months ago

This is an interesting question, which I think we should address. I see multiple use cases that could be tackled in one go:

As described originally, the addition of a new feature for all retrievables
Updating an existing feature on all or some retrievables
With more verbose extraction logging, a mechanism for recovering a partially successful extraction, e.g. resuming of an extraction on all fields, for some of the sources.

faberf commented 3 months ago

Another idea: Create a source which emits segments as retrievables that have been persisted in a previous run together with special content elements that describe which content elements are missing. Then, implement a special decoder which takes enumerated files and these retrieved retrievables (together with the gaps) and attemps to fill all the gaps.

faberf commented 1 month ago

I have an idea for solving this issue which also addresses the problem of restarting failed ingestions.

Include an option to configure the version of a pipeline config in the schema
in the backend, there is a one to many mapping from source metadata to versioned pipelines
the semantic is: Source S has been fully processed by pipeline P1 version V1 and pipeline P2 version V2 and so on
augment the enumerator to skip files that match a given metadata
augment the sink to properly tag the source as completed (relative to the given pipeline and version)
all the tagging, and checking logic should be reusable, to make it easy for new enumerators and sinks to be developed

@ppanopticon @lucaro What changes would you make to this concept?

EDIT: I just realized this actually does not address the issue, as everything would be resegmented upon version update.

lucaro commented 1 month ago

I guess there are fundamentally only two (types of) mechanisms needed: an enumerator that checks for every source if it is already known and emits the relevant retrievable with the already existing id without persisting it anew and a (or possibly multiple) segmenters that look up the existing segment boundaries for an existing retrievable and emit the same retrievables with the same ids and content again. Any versioning you might want to do of pipelines is, in my view, completely independent from these mechanisms.