vitrivr / vitrivr-engine

vitrivr's next-generation retrieval engine. It is capable of extracting and retrieving a wider range of multimedia objects such as audio, video, images or 3d models.
https://vitrivr.org
MIT License
6 stars 3 forks source link

Technical Metadata Extraction for Files #14

Closed ppanopticon closed 7 months ago

ppanopticon commented 1 year ago

Task Description

For retrievables that represent a file, we need the ability extract, store and retrieve file information (i.e., file name, absolute path, created, modified)

This must probably be implemented at an enumerator level. However, there are some architectural considerations that need clarification:

Dependencies

None

Boundary Conditions

None

lucaro commented 1 year ago

I'd argue that once a segmenter starts work on a new input, it should always generate a retrievable and send it downstream, independently of it having any associated content. The downstream operators are free to ignore it if it lacks the information they need. Since the segmenter stage is the first one that emits retirevables, it would be easiest to create these document-level instances there rather than in the enumerator. That does, however, mean that all the relevant information also needs to be propagated there. That should be doable via the source construct.

ppanopticon commented 7 months ago

This has been done with FileSourceMetadata. If additional fields are needed, open new ticket.