knowledgesystems / pipelines-scrum

Repository for tracking uncategorizable issues related to backend pipelines work
0 stars 0 forks source link

Investigating Minio version technology #983

Closed averyniceday closed 1 year ago

averyniceday commented 1 year ago

Done Condition (What do we need? Why do we need it? Keep this is small as possible!)

Explore additional features like adding description to versions, releasing tagging.

Technical Description (How are we going to achieve the above)

Potential Issues

Dependencies

Technical Requirements

Outside People/Teams

Changes

sheridancbio commented 1 year ago

This work was completed during a work session with Arfath yesterday. Follow up work will be in a new card which will include capturing a preliminary design around a plan to store data and maintain history in a Minio bucket, and to maintain a data directory in Dremio using an iceberg table (for mutability).

For creation of a version-tracked bucket we saw that our on-premises Minio server was currently incapable because the version tracking feature requires a clustered Minio deployment (not a single worker node setup). That is also now a follow up task [to convert the on-prem Minio server to a cluster deployment]. Instead we used the publicly available demo deployment of Minio here: https://play.min.io which did offered versioned buckets.

Several versions of a text file were put into the bucket, establishing the version history. Each new version of the object results in a new UUID identifier. Documentation explains that storage is not organized by diffs but by full atomic clones of the data. So we should consider separating large constant sections from smaller volatile sections (perhaps meta-data) if version depth is substantial and pervasive.

Version history can be queried through the CLI, web UI, or API. This was through an option to the bucket level list_object functionality (or the "ls" functionality in mc). Concise version tables are produced with dates, identifiers, and sizes. Delete operations on objects results in the addition of a DeleteMarker version of the object as the latest version of the object. Prior history remains available. Commands are also available to purge/rewrite history. Versions must be referred to by the version ID UUID, not the symbolic label "v3" or similar. If deletion of the latest version is performed, the next most recent version becomes the latest version. This can be used to "undelete" when a DeleteMarker version is deleted.

Python client from Minio was used to perform add, delete, list operations, fairly easily and straightforward. Responses were JSON formatted. List of versions for an object were obtained through the bucket operation list_objects.

Finally, each version of an object contains an attached user-accessible tag dictionary object. The API, cli and web UI allow easy editing of this dictionary object with arbitrary key-value pairs. However, after extended exploration, we were unable to find any built in functionality for searching for objects or version based on a specified key-value pair to be matched against these tag dictionary objects.