Closed NunoAlexandre closed 3 years ago
@kim @FintanH I appreciate your help formulating the requirements for this potential feature. The second point is too abstract, I would like to have something more specific to put it up against.
I think this is a good idea. It is much simpler than you seem to think, sorry for presuming deep familiarity with how libgit2
works. So, some background:
The terms refdb
and odb
(or "Object Database") are primarily libgit2
concepts, not git
-- one of the motivations for libgit2
was to allow GitHub to plug in web-scale storage below the standard git
protocol. Conceptually, the refdb
is simply a directory tree refs/
in GIT_DIR
, where the leaves (files) contain the SHA1 of a commit denoting a "branch head". Similarly, the odb
is simply a directory tree objects/
in GIT_DIR
which contains files whose name is their SHA1 hash, and which contain git
objects (blobs, trees, commits, tags).
Since both would eventually slow down git operations on large repositories (due to having to traverse the filesystem all the time), git
employs packing techniques. If you run git gc
, you'll notice that all the "loose objects" in objects/
are gone, and replaced with "packfiles". Lesser known is that it will also collect all your refs into a packed (ie. single-file) representation.
Regardless of whether the refs are packed or not, libgit2
will load them into memory all at once -- note that you also might have both packed and unpacked refs in your repo. So, we don't want to access the refs sidestepping libgit2
, but it is fine from an efficiency perspective to not push down glob processing to C, but simply get an unconstrained iterator and do the filtering in Rust.
Eventually, we may run into performance or memory issues because there are just too many refs in our giant monorepo, at which point we may consider plugging in a "real" database, which libgit2
allows us to do (modulo Rust bindings). This backend then can implement whatever globbing we want (it just gets a string passed to it), and decide when to page in what.
So, what we want for now is simply a librad
-internal wrapper around all things refs, so we use the libgit2
API only from one place. This thing can do globbing in Rust. If and when we get to the custom refdb
part, we just take that globbing, and stick it into our backend, while the librad
code continues to use the wrapper API.
One thing to add is that we don't have to stay compatible with git wildmatch
-- we could even support PCRE if we wanted (although that would probably always be slow).
I think this is a good idea. It is much simpler than you seem to think, sorry for presuming deep familiarity with how
libgit2
works. So, some background:
Appreciate that. Starting from first principles here!
Conceptually, the
refdb
is simply a directory treerefs/
inGIT_DIR
, where the leaves (files) contain the SHA1 of a commit denoting a "branch head". Similarly, theodb
is simply a directory treeobjects/
inGIT_DIR
which contains files whose name is their SHA1 hash, and which containgit
objects (blobs, trees, commits, tags).
Super clear, thanks for deconstructing it.
Since both would eventually slow down git operations on large repositories (due to having to traverse the filesystem all the time),
git
employs packing techniques. If you rungit gc
, you'll notice that all the "loose objects" inobjects/
are gone, and replaced with "packfiles". Lesser known is that it will also collect all your refs into a packed (ie. single-file) representation.
I will need to dive deeper into this. For now, the questions that raise are:
Regardless of whether the refs are packed or not,
libgit2
will load them into memory all at once -- note that you also might have both packed and unpacked refs in your repo. So, we don't want to access the refs sidesteppinglibgit2
, but it is fine from an efficiency perspective to not push down glob processing to C, but simply get an unconstrained iterator and do the filtering in Rust.
I see. So we don't want to duplicate this in-memory logic that libgit2
provides to avoid inconsistency issues, but the glob filtering on that in-memory data can (and needs to) be done on top of it.
Eventually, we may run into performance or memory issues because there are just too many refs in our giant monorepo, at which point we may consider plugging in a "real" database, which
libgit2
allows us to do (modulo Rust bindings).
That makes sense.
This backend then can implement whatever globbing we want (it just gets a string passed to it), and decide when to page in what.
&
So, what we want for now is simply a
librad
-internal wrapper around all things refs, so we use thelibgit2
API only from one place. This thing can do globbing in Rust. If and when we get to the customrefdb
part, we just take that globbing, and stick it into our backend, while thelibrad
code continues to use the wrapper API.
If I am getting it right, this custom glob solution would wrap around libgit2
. Say, we ask all_refs
, and it would call libgit2
and filter on top of that.
Let me have your thoughts and thanks so far!
When do or would we (link) package stuff up?
Currently, we rely on git receive-pack
to trigger GC, because libgit2
doesn't surface the compound git gc
in a straightforward way. That is, unless you sometimes push something from a local working copy, no GC will be triggered. We should probably employ some repacking of the entire repo at random intervals, but I haven't yet put up my mind where this should be triggered. git gc
decides for itself when repacking is actually required.
Would this package then serve as a sort of (fast) cache?
Not directly, it is more of an optimisation of loading things from disk. Everything in git is designed with the assumption of interactive use -- there are no long-running processes to cache things, so every time one runs git
, it has to access the filesystem. libgit2
, however, employs ways to detect whether something has changed on-disk by another process, so the fact that it keeps things in memory is indeed a cache if the process is long-running. We don't want to re-implement this logic, so going through libgit2
is a good idea.
Say, we ask all_refs, and it would call libgit2 and filter on top of that.
Yeah exactly -- we avoid calling libgit2
functions pertaining refs at random places, but have our own "refs API" (which is partially already in place).
When do or would we (link) package stuff up?
Currently, we rely on
git receive-pack
to trigger GC, becauselibgit2
doesn't surface the compoundgit gc
in a straightforward way. That is, unless you sometimes push something from a local working copy, no GC will be triggered. We should probably employ some repacking of the entire repo at random intervals, but I haven't yet put up my mind where this should be triggered.git gc
decides for itself when repacking is actually required.
:ok_hand:
Would this package then serve as a sort of (fast) cache?
Not directly, it is more of an optimisation of loading things from disk. Everything in git is designed with the assumption of interactive use -- there are no long-running processes to cache things, so every time one runs
git
, it has to access the filesystem.libgit2
, however, employs ways to detect whether something has changed on-disk by another process, so the fact that it keeps things in memory is indeed a cache if the process is long-running. We don't want to re-implement this logic, so going throughlibgit2
is a good idea.
Yes, that makes sense.
Say, we ask all_refs, and it would call libgit2 and filter on top of that.
Yeah exactly -- we avoid calling
libgit2
functions pertaining refs at random places, but have our own "refs API" (which is partially already in place).
Clear :100:
Thanks!
Going to close this since #286 fixed the issue. We can re-open if the issue props its ugly head again.
I'd rather have this implemented, because: libgit2
treats *
as **
if WM_PATHNAME
is not set (and it is not set for refnames). This means that, even though #286 fixes the immediate issue, it will almost certainly pop up again once a pattern accidentally matches deeper down the hierarchy.
Closed via 1a7c3f48dc156c14166764b2e3f862b867cc2812 in next
Motivation
In https://github.com/radicle-dev/radicle-link/issues/250, we learned that our
all_metadata
, operating on the globrefs/namespaces/*/rad/id
, haslibgit2
/git
going out of the intended scope for the specified glob and including theremotes
.If the monorepo
refs
looks like this:The identity
hybwg5ah79w533mt8wmho4kgdkdanh5u5uri8eppcc1dkoyq4jpqxw
fromremotes
fromhwd1yrer8qg6otsca7gmxm7dzwgk49qgkqmzdjsc1bpup4x5xz1quobagkw
is being included, when it shouldn't.Consideration
Requirements
If we decide to move forward and build our custom gobbling solution, we want to meet the following requirements:
Do not load the entire set of refs into memory :question: Do we want to lazy load them?
Be generic enough so that it can be used at different levels of abstraction (where libgit2 is now used + with possibly
refdb
in the future)