This change introduces the original_path and source_file_id fields to track file provenance, associating ingested files with their original location and potential source files. It also includes related schema updates and workflow adjustments.
Technical Details:
Schema Modifications:
The FileCreate and FileResponse schemas in src/api/v1/schemas.py now include an optional original_path field to store the original file path before ingestion. This preserves crucial context for investigations. A new source_file_id field has also been added to link derived files back to their source file, enabling provenance tracking.
A FileResponseCompact schema has been added and incorporated into the FileResponse schema to provide a compact representation of the source file information. This helps reduce the size of the response and improves efficiency.
Ingestion Logic Update: The process_successful_task function in src/mediator/mediator.py now extracts and stores the original_path and source_file_id from the file_data during file ingestion. This ensures this information is captured and associated with the newly created file record in the database.
Workflow Update: The run_workflow function in src/api/v1/workflows.py now includes the id of the input file in the data passed to the workflow. This likely supports the new source_file_id functionality, allowing workflows to track the origin of processed files and manage relationships between source and derived data. This is essential for maintaining data lineage and understanding the context of processed files.
Summary:
This change introduces the
original_path
andsource_file_id
fields to track file provenance, associating ingested files with their original location and potential source files. It also includes related schema updates and workflow adjustments.Technical Details:
Schema Modifications:
FileCreate
andFileResponse
schemas insrc/api/v1/schemas.py
now include an optionaloriginal_path
field to store the original file path before ingestion. This preserves crucial context for investigations. A newsource_file_id
field has also been added to link derived files back to their source file, enabling provenance tracking.FileResponseCompact
schema has been added and incorporated into theFileResponse
schema to provide a compact representation of the source file information. This helps reduce the size of the response and improves efficiency.Ingestion Logic Update: The
process_successful_task
function insrc/mediator/mediator.py
now extracts and stores theoriginal_path
andsource_file_id
from thefile_data
during file ingestion. This ensures this information is captured and associated with the newly created file record in the database.Workflow Update: The
run_workflow
function insrc/api/v1/workflows.py
now includes theid
of the input file in the data passed to the workflow. This likely supports the newsource_file_id
functionality, allowing workflows to track the origin of processed files and manage relationships between source and derived data. This is essential for maintaining data lineage and understanding the context of processed files.