Open faberf opened 7 months ago
This is an interesting question, which I think we should address. I see multiple use cases that could be tackled in one go:
Another idea: Create a source which emits segments as retrievables that have been persisted in a previous run together with special content elements that describe which content elements are missing. Then, implement a special decoder which takes enumerated files and these retrieved retrievables (together with the gaps) and attemps to fill all the gaps.
I have an idea for solving this issue which also addresses the problem of restarting failed ingestions.
@ppanopticon @lucaro What changes would you make to this concept?
EDIT: I just realized this actually does not address the issue, as everything would be resegmented upon version update.
I guess there are fundamentally only two (types of) mechanisms needed: an enumerator that checks for every source if it is already known and emits the relevant retrievable with the already existing id without persisting it anew and a (or possibly multiple) segmenters that look up the existing segment boundaries for an existing retrievable and emit the same retrievables with the same ids and content again. Any versioning you might want to do of pipelines is, in my view, completely independent from these mechanisms.
Consider the usecase where a large catalogue has already been segmented and some features have been extracted. Now, an additional feature needs to be extracted and connected to the existing segments. Currently, there is no practical way to do this (AFAIK) and the entire pipeline needs to be rerun.
I propose implementing an operator that retrieves segments that have been persisted, along with their source attributes. This operator would be the initial operator in the extraction pipeline for new features. I am not sure if retrieval at indexing time is meant to work with the existing querying system or if some problems will arise here. Also, in current pipeline configs the enumerators must come first, so work is needed here as well.