treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

GC: Changes to run id proposal + GC metadata and data location #4469

Closed N-o-Z closed 1 year ago

N-o-Z commented 2 years ago

The following suggestion, required also for the GC+ implementation.

  1. run_id to be created in a reversed lexicographical order
  2. Move all run information into a single prefix and out of the logs path
  3. Use run_id as part of the runs path instead of a timestamp
  4. Add start time and end time and add this information to the summary
  5. Optional: Add run_id to the summary

GC Run Path Structure

The GC run path will be the following: _lakefs/gc/<run_id>/ Under the path, include the following files:

summary.json expired_addresses/.parquet commits/.parquet

For GC+ the following files will be added:

uncommitted/*.parquet metadata/< metaranges and ranges >

Benefits / rational

Open Questions

  1. In the original proposal lakeFS creates the run_id. but the GC client is currently creating the timestamps (for the log path) This seems like an inconsistency that needs to be aligned.
N-o-Z commented 2 years ago

@treeverse/ecosystem Please review the following suggestion and add your comments.

N-o-Z commented 2 years ago

@treeverse/versioning-engine FYI

arielshaqed commented 1 year ago

@N-o-Z still relevant? Please reply and close or unassign.

N-o-Z commented 1 year ago

No longer relevant as it was decided to implement it differently