treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.35k stars 343 forks source link

Design "pack files" for Graveler committed data #5613

Open arielshaqed opened 1 year ago

arielshaqed commented 1 year ago

Why

This is an issue to design a particular performance improvement that applies to many cross-commit operations. It should be able to help speed up all of these on large repositories that perform many lakeFS operations:

I have seen the first 2 as user and customer issues, and I have seen the third in testing.

Packs

European Gray Wolf Pack, linked to image page and license

Periodically create "pack files" for lakeFS immutable data. Packs are smaller and held in fewer objects than the corresponding commits. We will need two kinds of pack files:

The power of packs

Packs are strictly optional. Code can never assume it has a pack for every commit that it needs to traverse. Packs can speed things up!

GC

Finding expired objects in GC is data intensive: every physical path in every range in every metarange in every commit must be annotated with whether or not it is expired, and then physical addresses must be reduced. So all ranges are scanned!

A path pack allows much faster computation: fewer objects need to be fetched from the backing store, and we can use algorithms on the commits graph to speed up listing the objects in the path pack which appear only in expired commits.

Blame

Path packs provide obvious opportunities to speed up logging by path.

Find merge base

When traversing the commits graph, as soon as a a graph pack is hit it can be traversed more rapidly than DynamoDB: just read the "parents" column from the pack.

How to hold commits

This is an interesting part of discovery: what's a good way to hold the commits at which a given <lakeFS path, physical path> appeared?

A good first phase might be to hold a bitmap of whether each keypair appears in each commit.

But we can probably do much better as a second phase! Hold a for each keypair the commits at which it appeared and the commits at which it vanished. Typically a keypair appears on a single commit (and new changes in lakeFS uncommitted data will guarantee this in future), and vanishes on few commits. So this list can be short. Now we can perform algorithms on ranges in graphs to determine when the object on a path changed :-)

github-actions[bot] commented 10 months ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.