Closed Byron closed 2 years ago
An interesting read, even though the details of who signs what are still illusive to me.
You may want to look at the identities section in the spec 0 to understand how we reify the notion of a "repository" in a fully distributed setting.
In this document, the significant file is anchored at rad/signed_refs
. You
can think of this as a ls-refs
with a signature made by the origin server --
because it is not necessarily the origin server where a peer fetches a set of
refs from, the authentication must travel with the data in a peer-to-peer
network. Incidentally, this is similar to 1, except that we don't employ
nonces.
The git-ref crate and its loose file db makes setting symbolic refs transactional (which in canonical git it is not).
Yes it's already helpful :)
The biggest issue to watch out for might be memory usage. Assuming worst-case scenarios with thousands of concurrent clones of the linux kernel repository, currently each one would use ~600MB just to keep the object list data. For this I could imagine a best-guess accounting for available memory along with a way of queuing up clones to not over-allocate such transient memory.
I would actually be quite keen on being able to utilise the packfile-uris
feature, which I think is most useful when using multi-pack indices (ie. not
compact into one giant packfile). This would mean that we could heavily optimise
clones and large fetches by simply sendfile(2)
pre-built packs. That may send
more data than strictly required, but with much lower latency -- the receiving
end can run git maintenance
then.
PS: iff
is not a typo, but short for "if and only if". Perhaps it should've
been set in italics :)
You may want to look at the identities section in the spec [0] to understand how we reify the notion of a "repository" in a fully distributed setting.
I must if I ever want to understand these important details, and one day I will. It's on the reading list :).
Yes it's already helpful :)
🎉 Feels good to hear that, thanks 😊!
I would actually be quite keen on being able to utilise the
packfile-uris
feature, which I think is most useful when using multi-pack indices (ie. not compact into one giant packfile). This would mean that we could heavily optimise clones and large fetches by simplysendfile(2)
pre-built packs. That may send more data than strictly required, but with much lower latency -- the receiving end can rungit maintenance
then.
That's interesting, as I can't see (thanks to my limited multi-pack knowledge) how these pack files would be kept small. This would mean that git packs each object island separately maybe, only then one could avoid to send huge mono-repo pack files in send-file mode.
PS:
iff
is not a typo, but short for "if and only if". Perhaps it should've been set in italics :)
Hehe, alright. I have adjusted the PR if there is interest. Otherwise I am happy to close it understanding that the target audience will get that.
An interesting read, even though the details of who signs what are still illusive to me. It's probably just a matter of knowing what the terms mean exactly and diving in more. Fortunately, the git-related topics I could follow well and have a few notes to share:
The
git-ref
crate and its loose file db makes setting symbolic refs transactional (which in canonical git it is not). This might be particularly useful here. I plan to offer the same transaction based API with symlink support for thereftable
implementation. After having read this, I also understand even better why @kim keeps talking about it - it's definitely required to scale beyond a certain point. My guess is that reaching that point will take a while as looking up refs with mem-mapped packed-refs file is fast (binary search) and probably doesn't cost many IOPs in average. Right now, it tries to read a loose refs file, then looks up a possibly pre-loaded and mem-mapped packed refs buffer. By default all changes go into loose ref files, and regular repacks will certainly help to keep this fast. This is probably where issues occur as busy monorepos will constantly have changes in their loose refs, possibly causing permanent failures when trying to lock a lot of references for packing. In that regard I would assumereftable
is much better.I agree, there is no issue in using refs that way.
I find the "a giant monorepo on the server side" approach very intriguing as it will require an efficient ODB implementation and properly implemented ODB GC from the start, also to assure consistency. Fortunately, with a 'collaborative set of server processes and a gatekeeper daemon' I see no issue offering a consistent view on the ODB and RefDB even when facing continuous writes alongside even more reads with the server managing repacks automatically and smartly. Once
gitoxide
implements the various caches that exist on the server side, it should be possible to reduce CPU usage significantly during clones which are currently dominated by the 'counting objects' phase.The biggest issue to watch out for might be memory usage. Assuming worst-case scenarios with thousands of concurrent clones of the linux kernel repository, currently each one would use ~600MB just to keep the object list data. For this I could imagine a best-guess accounting for available memory along with a way of queuing up clones to not over-allocate such transient memory.
Pack decoding and encoding performance, to my mind, is held back by zlib. Even though
gitoxide
already uses zlib-ng in its fast builds, it's still 'just' zlib. A possible optimization would be to introduce server-side-only versions of the packs that use a different compression algorithm, which could easily be 10x faster at similar compression ratios both for compression and decompression.Also note that currently
gitoxide
manages to achieve clone speeds of around 650MB/s on a MacBook Air (M1) so its easy to imagine a server saturating its 10Gb NIC even with the current pack format.CC: @joshtriplett