This is an issue to design a particular performance improvement that applies to many cross-commit operations. It should be able to help speed up all of these on large repositories that perform many lakeFS operations:
Computing expired addresses in GC
Path-specific lakectl log and "blame"
Finding merge bases (requires only a smaller part)
I have seen the first 2 as user and customer issues, and I have seen the third in testing.
Packs
Periodically create "pack files" for lakeFS immutable data. Packs are smaller and held in fewer objects than the corresponding commits. We will need two kinds of pack files:
Graph packs: A Parquet file that holds part of the commits DAG. Having the commits DAG allows rapid graph traversal.
Path packs: A Parquet file (so actually many objects on the backing store) that holds a large number of Graveler metaranges contents. Keys are TBD <lakeFS key,physical path> or <lakeFS key,some commit where this physical path appeared>, so each physical path for a path can appear. Values include all object metadata as well as identification of all commits where that key holds that physical path. Sketches of some ways to do this below.
The power of packs
Packs are strictly optional. Code can never assume it has a pack for every commit that it needs to traverse. Packs can speed things up!
GC
Finding expired objects in GC is data intensive: every physical path in every range in every metarange in every commit must be annotated with whether or not it is expired, and then physical addresses must be reduced. So all ranges are scanned!
A path pack allows much faster computation: fewer objects need to be fetched from the backing store, and we can use algorithms on the commits graph to speed up listing the objects in the path pack which appear only in expired commits.
Blame
Path packs provide obvious opportunities to speed up logging by path.
Find merge base
When traversing the commits graph, as soon as a a graph pack is hit it can be traversed more rapidly than DynamoDB: just read the "parents" column from the pack.
How to hold commits
This is an interesting part of discovery: what's a good way to hold the commits at which a given <lakeFS path, physical path> appeared?
A good first phase might be to hold a bitmap of whether each keypair appears in each commit.
But we can probably do much better as a second phase! Hold a for each keypair the commits at which it appeared and the commits at which it vanished. Typically a keypair appears on a single commit (and new changes in lakeFS uncommitted data will guarantee this in future), and vanishes on few commits. So this list can be short. Now we can perform algorithms on ranges in graphs to determine when the object on a path changed :-)
Why
This is an issue to design a particular performance improvement that applies to many cross-commit operations. It should be able to help speed up all of these on large repositories that perform many lakeFS operations:
lakectl log
and "blame"I have seen the first 2 as user and customer issues, and I have seen the third in testing.
Packs
Periodically create "pack files" for lakeFS immutable data. Packs are smaller and held in fewer objects than the corresponding commits. We will need two kinds of pack files:
Graph packs: A Parquet file that holds part of the commits DAG. Having the commits DAG allows rapid graph traversal.
Path packs: A Parquet file (so actually many objects on the backing store) that holds a large number of Graveler metaranges contents. Keys are TBD <lakeFS key,physical path> or <lakeFS key,some commit where this physical path appeared>, so each physical path for a path can appear. Values include all object metadata as well as identification of all commits where that key holds that physical path. Sketches of some ways to do this below.
The power of packs
Packs are strictly optional. Code can never assume it has a pack for every commit that it needs to traverse. Packs can speed things up!
GC
Finding expired objects in GC is data intensive: every physical path in every range in every metarange in every commit must be annotated with whether or not it is expired, and then physical addresses must be reduced. So all ranges are scanned!
A path pack allows much faster computation: fewer objects need to be fetched from the backing store, and we can use algorithms on the commits graph to speed up listing the objects in the path pack which appear only in expired commits.
Blame
Path packs provide obvious opportunities to speed up logging by path.
Find merge base
When traversing the commits graph, as soon as a a graph pack is hit it can be traversed more rapidly than DynamoDB: just read the "parents" column from the pack.
How to hold commits
This is an interesting part of discovery: what's a good way to hold the commits at which a given <lakeFS path, physical path> appeared?
A good first phase might be to hold a bitmap of whether each keypair appears in each commit.
But we can probably do much better as a second phase! Hold a for each keypair the commits at which it appeared and the commits at which it vanished. Typically a keypair appears on a single commit (and new changes in lakeFS uncommitted data will guarantee this in future), and vanishes on few commits. So this list can be short. Now we can perform algorithms on ranges in graphs to determine when the object on a path changed :-)