Consider implementing a custom globbing

NunoAlexandre commented 4 years ago

Motivation

In https://github.com/radicle-dev/radicle-link/issues/250, we learned that our all_metadata, operating on the glob refs/namespaces/*/rad/id, has libgit2/git going out of the intended scope for the specified glob and including the remotes.

If the monorepo refs looks like this:

refs
├── namespaces
│   ├── hwd1yrebpxctmiom46eyswho97qqf8tdd6yn3buqi4kqyc5es9mg4fepmmy
│   │   └── refs
│   │       └── rad
│   │           ├── id
│   │           ├── self
│   │           └── signed_refs│   
├── hwd1yregdx577qxhw1g69osonee3cna4eq69z75dnxx4yjyppgp93qn8a4o
│   │   └── refs
│   │       └── rad
│   │           ├── id
│   │           ├── self
│   │           └── signed_refs
│   ├── hwd1yreng97ow5j644xzaxc9w5jomam5mwar6ywig3c64eozfcm3ez3a9ce
│   │   └── refs
│   │       └── rad
│   │           ├── id
│   │           ├── self
│   │           └── signed_refs
│   └── hwd1yrer8qg6otsca7gmxm7dzwgk49qgkqmzdjsc1bpup4x5xz1quobagkw
│       └── refs
│           ├── heads
│           │   ├── dev
│           │   └── master
│           ├── rad
│           │   ├── id
│           │   ├── ids
│           │   │   └── hwd1yreng97ow5j644xzaxc9w5jomam5mwar6ywig3c64eozfcm3ez3a9ce
│           │   ├── self
│           │   └── signed_refs
│           ├── remotes
│           │   └── hybwg5ah79w533mt8wmho4kgdkdanh5u5uri8eppcc1dkoyq4jpqxw
│           │       ├── heads
│           │       │   └── master
│           │       └── rad
│           │           ├── id
│           │           ├── self
│           │           └── signed_refs
│           └── tags
│               ├── v0.1.0
│               ├── v0.2.0
│               ├── v0.3.0
│               ├── v0.4.0
│               └── v0.5.0

The identity hybwg5ah79w533mt8wmho4kgdkdanh5u5uri8eppcc1dkoyq4jpqxw from remotes from hwd1yrer8qg6otsca7gmxm7dzwgk49qgkqmzdjsc1bpup4x5xz1quobagkw is being included, when it shouldn't.

Consideration

It is however not crucial that we rely on libgit2 for this kind of thing -- it will load the entire set of refs into memory anyway, so we can also opt to implement our own globbing. If we do that, it would just be good to not sprinkle it all over the place, as we might eventually want to plug in a custom refdb, so being able to reuse that code one layer down would be neat.

Requirements

If we decide to move forward and build our custom gobbling solution, we want to meet the following requirements:

Do not load the entire set of refs into memory :question: Do we want to lazy load them?
Be generic enough so that it can be used at different levels of abstraction (where libgit2 is now used + with possibly refdb in the future)

NunoAlexandre commented 4 years ago

@kim @FintanH I appreciate your help formulating the requirements for this potential feature. The second point is too abstract, I would like to have something more specific to put it up against.

kim commented 4 years ago

I think this is a good idea. It is much simpler than you seem to think, sorry for presuming deep familiarity with how libgit2 works. So, some background:

The terms refdb and odb (or "Object Database") are primarily libgit2 concepts, not git -- one of the motivations for libgit2 was to allow GitHub to plug in web-scale storage below the standard git protocol. Conceptually, the refdb is simply a directory tree refs/ in GIT_DIR, where the leaves (files) contain the SHA1 of a commit denoting a "branch head". Similarly, the odb is simply a directory tree objects/ in GIT_DIR which contains files whose name is their SHA1 hash, and which contain git objects (blobs, trees, commits, tags).

Since both would eventually slow down git operations on large repositories (due to having to traverse the filesystem all the time), git employs packing techniques. If you run git gc, you'll notice that all the "loose objects" in objects/ are gone, and replaced with "packfiles". Lesser known is that it will also collect all your refs into a packed (ie. single-file) representation.

Regardless of whether the refs are packed or not, libgit2 will load them into memory all at once -- note that you also might have both packed and unpacked refs in your repo. So, we don't want to access the refs sidestepping libgit2, but it is fine from an efficiency perspective to not push down glob processing to C, but simply get an unconstrained iterator and do the filtering in Rust.

Eventually, we may run into performance or memory issues because there are just too many refs in our giant monorepo, at which point we may consider plugging in a "real" database, which libgit2 allows us to do (modulo Rust bindings). This backend then can implement whatever globbing we want (it just gets a string passed to it), and decide when to page in what.

So, what we want for now is simply a librad-internal wrapper around all things refs, so we use the libgit2 API only from one place. This thing can do globbing in Rust. If and when we get to the custom refdb part, we just take that globbing, and stick it into our backend, while the librad code continues to use the wrapper API.

kim commented 4 years ago

One thing to add is that we don't have to stay compatible with git wildmatch -- we could even support PCRE if we wanted (although that would probably always be slow).

NunoAlexandre commented 4 years ago

I think this is a good idea. It is much simpler than you seem to think, sorry for presuming deep familiarity with how libgit2 works. So, some background:

Appreciate that. Starting from first principles here!

Conceptually, the refdb is simply a directory tree refs/ in GIT_DIR, where the leaves (files) contain the SHA1 of a commit denoting a "branch head". Similarly, the odb is simply a directory tree objects/ in GIT_DIR which contains files whose name is their SHA1 hash, and which contain git objects (blobs, trees, commits, tags).

Super clear, thanks for deconstructing it.

Since both would eventually slow down git operations on large repositories (due to having to traverse the filesystem all the time), git employs packing techniques. If you run git gc, you'll notice that all the "loose objects" in objects/ are gone, and replaced with "packfiles". Lesser known is that it will also collect all your refs into a packed (ie. single-file) representation.

I will need to dive deeper into this. For now, the questions that raise are:

When do or would we (link) package stuff up? Would it be at a given size threshold, periodically, ?
Would this package then serve as a sort of (fast) cache? How does it handle changes in the actual raw monorepo (say, a project is updated)?

Regardless of whether the refs are packed or not, libgit2 will load them into memory all at once -- note that you also might have both packed and unpacked refs in your repo. So, we don't want to access the refs sidestepping libgit2, but it is fine from an efficiency perspective to not push down glob processing to C, but simply get an unconstrained iterator and do the filtering in Rust.

I see. So we don't want to duplicate this in-memory logic that libgit2 provides to avoid inconsistency issues, but the glob filtering on that in-memory data can (and needs to) be done on top of it.

Eventually, we may run into performance or memory issues because there are just too many refs in our giant monorepo, at which point we may consider plugging in a "real" database, which libgit2 allows us to do (modulo Rust bindings).

That makes sense.

This backend then can implement whatever globbing we want (it just gets a string passed to it), and decide when to page in what.

&

So, what we want for now is simply a librad-internal wrapper around all things refs, so we use the libgit2 API only from one place. This thing can do globbing in Rust. If and when we get to the custom refdb part, we just take that globbing, and stick it into our backend, while the librad code continues to use the wrapper API.

If I am getting it right, this custom glob solution would wrap around libgit2. Say, we ask all_refs, and it would call libgit2 and filter on top of that.

Let me have your thoughts and thanks so far!

kim commented 4 years ago

When do or would we (link) package stuff up?

Currently, we rely on git receive-pack to trigger GC, because libgit2 doesn't surface the compound git gc in a straightforward way. That is, unless you sometimes push something from a local working copy, no GC will be triggered. We should probably employ some repacking of the entire repo at random intervals, but I haven't yet put up my mind where this should be triggered. git gc decides for itself when repacking is actually required.

Would this package then serve as a sort of (fast) cache?

Not directly, it is more of an optimisation of loading things from disk. Everything in git is designed with the assumption of interactive use -- there are no long-running processes to cache things, so every time one runs git, it has to access the filesystem. libgit2, however, employs ways to detect whether something has changed on-disk by another process, so the fact that it keeps things in memory is indeed a cache if the process is long-running. We don't want to re-implement this logic, so going through libgit2 is a good idea.

Say, we ask all_refs, and it would call libgit2 and filter on top of that.

Yeah exactly -- we avoid calling libgit2 functions pertaining refs at random places, but have our own "refs API" (which is partially already in place).

NunoAlexandre commented 4 years ago

When do or would we (link) package stuff up?

Currently, we rely on git receive-pack to trigger GC, because libgit2 doesn't surface the compound git gc in a straightforward way. That is, unless you sometimes push something from a local working copy, no GC will be triggered. We should probably employ some repacking of the entire repo at random intervals, but I haven't yet put up my mind where this should be triggered. git gc decides for itself when repacking is actually required.

:ok_hand:

Would this package then serve as a sort of (fast) cache?

Not directly, it is more of an optimisation of loading things from disk. Everything in git is designed with the assumption of interactive use -- there are no long-running processes to cache things, so every time one runs git, it has to access the filesystem. libgit2, however, employs ways to detect whether something has changed on-disk by another process, so the fact that it keeps things in memory is indeed a cache if the process is long-running. We don't want to re-implement this logic, so going through libgit2 is a good idea.

Yes, that makes sense.

Say, we ask all_refs, and it would call libgit2 and filter on top of that.

Yeah exactly -- we avoid calling libgit2 functions pertaining refs at random places, but have our own "refs API" (which is partially already in place).

Clear :100:

Thanks!

FintanH commented 4 years ago

Going to close this since #286 fixed the issue. We can re-open if the issue props its ugly head again.

kim commented 4 years ago

I'd rather have this implemented, because: libgit2 treats * as ** if WM_PATHNAME is not set (and it is not set for refnames). This means that, even though #286 fixes the immediate issue, it will almost certainly pop up again once a pattern accidentally matches deeper down the hierarchy.

kim commented 3 years ago

Closed via 1a7c3f48dc156c14166764b2e3f862b867cc2812 in next

radicle-dev / radicle-link