radicle-dev / radicle-link

The second iteration of the Radicle code collaboration protocol.
Other
421 stars 39 forks source link

Spell out 'iff' in rfc0001 #774

Closed Byron closed 2 years ago

Byron commented 2 years ago

An interesting read, even though the details of who signs what are still illusive to me. It's probably just a matter of knowing what the terms mean exactly and diving in more. Fortunately, the git-related topics I could follow well and have a few notes to share:

The use of symrefs below the refs hierarchy is somewhat unorthodox. As symrefs were invented to replace actual filesystem symbolic links (which are not entirely portable), it seems unlikely they would eventually stop working. If they did, we could still revert to symlinks again, and accept that this may limit platform choice for users.

The git-ref crate and its loose file db makes setting symbolic refs transactional (which in canonical git it is not). This might be particularly useful here. I plan to offer the same transaction based API with symlink support for the reftable implementation. After having read this, I also understand even better why @kim keeps talking about it - it's definitely required to scale beyond a certain point. My guess is that reaching that point will take a while as looking up refs with mem-mapped packed-refs file is fast (binary search) and probably doesn't cost many IOPs in average. Right now, it tries to read a loose refs file, then looks up a possibly pre-loaded and mem-mapped packed refs buffer. By default all changes go into loose ref files, and regular repacks will certainly help to keep this fast. This is probably where issues occur as busy monorepos will constantly have changes in their loose refs, possibly causing permanent failures when trying to lock a lot of references for packing. In that regard I would assume reftable is much better.

The refs/rad category is obviously also not entirely kosher, but since there are no hints in the git source code that refs/namespaces is treated specially, there is no reason to believe this would suddenly stop working. If it did, the only thing that would get more involved is the working copy branch mapping (which is managed).

I agree, there is no issue in using refs that way.

Lastly, with git being very much IO-bound, there are limits to (ab)using it as a giant monorepo.

I find the "a giant monorepo on the server side" approach very intriguing as it will require an efficient ODB implementation and properly implemented ODB GC from the start, also to assure consistency. Fortunately, with a 'collaborative set of server processes and a gatekeeper daemon' I see no issue offering a consistent view on the ODB and RefDB even when facing continuous writes alongside even more reads with the server managing repacks automatically and smartly. Once gitoxide implements the various caches that exist on the server side, it should be possible to reduce CPU usage significantly during clones which are currently dominated by the 'counting objects' phase.

The biggest issue to watch out for might be memory usage. Assuming worst-case scenarios with thousands of concurrent clones of the linux kernel repository, currently each one would use ~600MB just to keep the object list data. For this I could imagine a best-guess accounting for available memory along with a way of queuing up clones to not over-allocate such transient memory.

Pack decoding and encoding performance, to my mind, is held back by zlib. Even though gitoxide already uses zlib-ng in its fast builds, it's still 'just' zlib. A possible optimization would be to introduce server-side-only versions of the packs that use a different compression algorithm, which could easily be 10x faster at similar compression ratios both for compression and decompression.

Also note that currently gitoxide manages to achieve clone speeds of around 650MB/s on a MacBook Air (M1) so its easy to imagine a server saturating its 10Gb NIC even with the current pack format.

CC: @joshtriplett

kim commented 2 years ago

An interesting read, even though the details of who signs what are still illusive to me.

You may want to look at the identities section in the spec 0 to understand how we reify the notion of a "repository" in a fully distributed setting.

In this document, the significant file is anchored at rad/signed_refs. You can think of this as a ls-refs with a signature made by the origin server -- because it is not necessarily the origin server where a peer fetches a set of refs from, the authentication must travel with the data in a peer-to-peer network. Incidentally, this is similar to 1, except that we don't employ nonces.

The git-ref crate and its loose file db makes setting symbolic refs transactional (which in canonical git it is not).

Yes it's already helpful :)

The biggest issue to watch out for might be memory usage. Assuming worst-case scenarios with thousands of concurrent clones of the linux kernel repository, currently each one would use ~600MB just to keep the object list data. For this I could imagine a best-guess accounting for available memory along with a way of queuing up clones to not over-allocate such transient memory.

I would actually be quite keen on being able to utilise the packfile-uris feature, which I think is most useful when using multi-pack indices (ie. not compact into one giant packfile). This would mean that we could heavily optimise clones and large fetches by simply sendfile(2) pre-built packs. That may send more data than strictly required, but with much lower latency -- the receiving end can run git maintenance then.

PS: iff is not a typo, but short for "if and only if". Perhaps it should've been set in italics :)

Byron commented 2 years ago

You may want to look at the identities section in the spec [0] to understand how we reify the notion of a "repository" in a fully distributed setting.

I must if I ever want to understand these important details, and one day I will. It's on the reading list :).

Yes it's already helpful :)

🎉 Feels good to hear that, thanks 😊!

I would actually be quite keen on being able to utilise the packfile-uris feature, which I think is most useful when using multi-pack indices (ie. not compact into one giant packfile). This would mean that we could heavily optimise clones and large fetches by simply sendfile(2) pre-built packs. That may send more data than strictly required, but with much lower latency -- the receiving end can run git maintenance then.

That's interesting, as I can't see (thanks to my limited multi-pack knowledge) how these pack files would be kept small. This would mean that git packs each object island separately maybe, only then one could avoid to send huge mono-repo pack files in send-file mode.

PS: iff is not a typo, but short for "if and only if". Perhaps it should've been set in italics :)

Hehe, alright. I have adjusted the PR if there is interest. Otherwise I am happy to close it understanding that the target audience will get that.