Closed ppanopticon closed 7 months ago
I'd argue that once a segmenter starts work on a new input, it should always generate a retrievable and send it downstream, independently of it having any associated content. The downstream operators are free to ignore it if it lacks the information they need. Since the segmenter stage is the first one that emits retirevables, it would be easiest to create these document-level instances there rather than in the enumerator. That does, however, mean that all the relevant information also needs to be propagated there. That should be doable via the source construct.
This has been done with FileSourceMetadata
. If additional fields are needed, open new ticket.
Task Description
For retrievables that represent a file, we need the ability extract, store and retrieve file information (i.e., file name, absolute path, created, modified)
struct
.This must probably be implemented at an enumerator level. However, there are some architectural considerations that need clarification:
Dependencies
None
Boundary Conditions
None