vitrivr / vitrivr-engine

vitrivr's next-generation retrieval engine
MIT License
4 stars 2 forks source link

Pipeline that Starts with Previously Segmented Media #63

Open faberf opened 3 months ago

faberf commented 3 months ago

Consider the usecase where a large catalogue has already been segmented and some features have been extracted. Now, an additional feature needs to be extracted and connected to the existing segments. Currently, there is no practical way to do this (AFAIK) and the entire pipeline needs to be rerun.

I propose implementing an operator that retrieves segments that have been persisted, along with their source attributes. This operator would be the initial operator in the extraction pipeline for new features. I am not sure if retrieval at indexing time is meant to work with the existing querying system or if some problems will arise here. Also, in current pipeline configs the enumerators must come first, so work is needed here as well.

sauterl commented 3 months ago

This is an interesting question, which I think we should address. I see multiple use cases that could be tackled in one go:

faberf commented 3 months ago

Another idea: Create a source which emits segments as retrievables that have been persisted in a previous run together with special content elements that describe which content elements are missing. Then, implement a special decoder which takes enumerated files and these retrieved retrievables (together with the gaps) and attemps to fill all the gaps.

faberf commented 1 month ago

I have an idea for solving this issue which also addresses the problem of restarting failed ingestions.

@ppanopticon @lucaro What changes would you make to this concept?

EDIT: I just realized this actually does not address the issue, as everything would be resegmented upon version update.

lucaro commented 1 month ago

I guess there are fundamentally only two (types of) mechanisms needed: an enumerator that checks for every source if it is already known and emits the relevant retrievable with the already existing id without persisting it anew and a (or possibly multiple) segmenters that look up the existing segment boundaries for an existing retrievable and emit the same retrievables with the same ids and content again. Any versioning you might want to do of pipelines is, in my view, completely independent from these mechanisms.