ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

nix package manager integration #51

Open jbenet opened 8 years ago

jbenet commented 8 years ago

ipfs and nix are well suited for each other. Discussed possibilities with @ehmry. Let's use this issue to figure out what to do. To start us off, some things that come to mind:

ehmry commented 8 years ago

Item two is pretty easy to get done, I have a simple skeleton that is waiting for ipget to drop into place, and then there will be a shell utility that takes a traditional url and spits out the fetchipfs { ... } code to cut and paste into a package specification.

Number three is a departure from how nix does things, but I think can coexist fine with the build farm system they have now. Nix packages come in the form a file or directory named xxx-name, where xxx is the hash of the the package inputs, and name is the package name, pmpc0i6i2kqvarjmbjwaxlb4h801jzfy-go-1.4.2 is an example. It is important to realise that this hash prefix does not identify the package content, it serves to create a unique name for a package based on the package inputs, its dependencies and build environment, so that different versions of a package with different dependencies can be installed side-by-side. There is another reason why it is not content that I will get to later.

I think that item three should be a matter of mapping package content to these package names, identifying these inputs with outputs. I don't know any of the implementation details, but this is what the nix build farm does. What we can do is map package outputs to the package input names using IPNS (correct me if I'm wrong on any of this). I think the process would be like this:

Alice

Bob

You may notice here that package is duplicated both at Alice and Bob, because it must exist at /nix/store, and also be present at some point in an IPFS repo. Nix requires that a package always be located at /nix/store, this is because of the strictly versioned depencies. If IPFS gets very strong file system intergration, then perhaps nix packages can be located under /ipns, /ipfs will not work because packages often must know their final location, and there no way to update an ipfs object with its location without implicitly changing its location. To me most expedient solution to duplication would be to archive and compress packages before adding them to IPFS, and then leave them unpinned.

davidar commented 8 years ago

:+1:

Also see https://botbot.me/freenode/ipfs/msg/53212515/

jbenet commented 8 years ago

btw, a basic ipget exists now: https://github.com/noffle/ipget thanks to @noffle

hackergrrl commented 8 years ago

I'm excited to see nix and ipfs get along well -- let me know if there are any ipget goodies you need!

CMCDragonkai commented 8 years ago

Just wanted to add:

We also need an IPFS source cache, not just binary cache. It would be an immutable source mirroring for all of our dependencies allowing reproducible builds from source (which is sometimes required as not all compiler flags are specified by default). I've had tons of problems buildings from source in Nix because upstream source urls break.

ehmry commented 8 years ago

It probably looks like I've given up on this, but I've been working on an alternate content-addressed store below Nix (and Guix I believe), and I will be attempting to build nixpkgs against it. If this works then the next steps would be to move to a multihash scheme and then try and figure out if its possible to push the store objects into the DHT without a second storage representation. Or maybe I've forgotten how the block chunking works, idk.

I can't give any estimates on time, there are a lot of things for me to do in between to make it work on my side.

rrnewton commented 7 years ago

@ehmry - was the alternate content addressable store because something about IPFS/IPNS didn't work out?

Also, about the trust model above -- is the idea that in practice we would trust a central Nix site to resolve a <hash>-package key into a content hash of build outputs?

This is discussed a bit in that #ipfs IRC log. "pierron" mentions that the build farm manifest includes signed content hashes. It sounds like the plan is to use the build farm as the central authority. One additional point is that since Nix isn't enforcing determinism currently, multiple observers that come up with (key,content-hash) pairs will come up with different hashes. That seems to also be an argument for a central authority (sample the nondeterminism once, consistently).

UPDATE: this point is addressed in more detail here: https://github.com/NixOS/nix/issues/859.

ehmry commented 7 years ago

@rrnewton My motivation for an alternative store was one without a database, paths would be self-verifying and gargage collection would use something like bloom filers, but that project is on hold.

vcunat commented 7 years ago

Better late than later. Yes, I'd certainly start primarily with one central authority, which is what we de-facto have now. But it's only a matter of mapping .drv hash + authority key -> output hashes, and then people can simply trust any authorities they want.

davidak commented 6 years ago

We have the second item implemented: https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support/fetchipfs

@mguentner implemented a mirror, but encountered serious performance/traffic issues, concluded that IPFS is "not usable for production" with this use case.

Discussion in: https://github.com/NixOS/nix/issues/859#issuecomment-386808776

dvc94ch commented 5 years ago

There is a fundamental problem with mapping the nix store to ipfs. Assume we mount a performant ipfs fuse implementation at /ipfs and we change nixStore to /ipfs and store path == cid. Nix mounts a tmp filesystem at /ipfs/{randomCid}, and builds the package with out=/ipfs/{randomCid}. After the build nix computes the hash of let h = hash(replace(readPath("/ipfs/{randomCid}"), {randomCid}, 0}) and then writes writePath("/ipfs/{cid(h)}", replace(readPath("/ipfs/{randomCid}"), {randomCid}, cid(h)). This is a problem with any multiblock datastructure in ipfs containing self references. Because of this the only solution would be to keep the nix/ipfs stores separate causing every package to be present packed and unpacked and decreasing incentive to not leach, since why keep files around you don't need.

dvc94ch commented 5 years ago

I guess we could resolve this with a new cid format cidv2 = <cidv1><cidv1> where for a root we have root = x'x and leaf = yx where the hash in y is computed after replace(x, 0).

dvc94ch commented 5 years ago

another solution would be to have an overlay fuse /nix/store/{cid_equivalence_class} that forwards readdir calls to /ipfs/{cid_actual} based on the uid. This is basically dynamic hash rewriting at runtime.

vcunat commented 5 years ago

My understanding is that we haven't got near mounting, as just writing finished results wasn't performant enough. Due to propagation of even almost-surely-insignificant changes from dependencies and high development activity (on the order of a thousand PRs/month), we need quite high longterm write throughput to cache.nixos.org... well it was surely discussed at the links davidak posted.

dvc94ch commented 5 years ago

You have a build farm in mind. With nix flakes the owner of the flake publishes a flake -> output mapping, so nixpkgs would be split similar to aur, and the package maintainer performs the build. This makes write throughout less important.

dvc94ch commented 5 years ago

I'm currently working on a detailed proposal, and milestones. I'll post it here in a couple of days to get feedback from the nix/ipfs communities...

vcunat commented 5 years ago

Well, I can't imagine distributing building or serving to maintainers. (Apart from the fact that most packages still don't have any.) That would be a huge change in how the ecosystem works, and I'm very doubtful that would be practical.

I can imagine that each build machine of the farm would "write" their results directly to IPFS. That would be nice even for other reasons, but perhaps that way we could make the combined write throughput high enough.

dvc94ch commented 5 years ago

Preliminary work is here, obviously things are still a little vague...

https://gist.github.com/dvc94ch/2ce60a00550e83d95ed051fc81e3683e

dvc94ch commented 5 years ago

So an mvp for a decentralized distributed nix store would be ipfs taking some good ideas from nix as described in the gist. and a blockchain layer. But as ipfs and nix have shown in prior art there is nothing really new here.

@vcunat what are your thoughts on the blockchain layer concept?

Blockchain layer

Publishers publish a derivation by sending a transaction to the blockchain. Registered substituters are randomly selected to build the derivation. If all substituters get the same output the build is marked as reproducible. Both substituters and publishers only require a light client.

Publishing a derivation

  1. Publisher sends a transaction to an authority. The transaction contains the cid of the derivation, a sealed commitment to reveal the hash, retained references and size of the output, and the build time for building the derivation. A small transaction fee is payed and an amount of funds are locked.
  2. The authority minting the block fetches the derivation and verifies it is a valid derivation. To prevent collusion it then randomly selects n substituters out of the registered substituters using a VRF and includes the transaction in the block including the selected substituters and the proof of the VRF.
  3. All selected substituters build the derivation and send a transaction containing a sealed commitment to reveal the hash, etc. After the first substituter commits all other substituters must commit within maximum f(t).
  4. After all substituters have commited, all substituters and the publisher reveal their values. The values must be revealed within time t.
  5. The references and size are used as proof that the derivation was built. Those having deviant values are slashed. Substituters conforming to the majority are rewarded according to f(build time).

Substituting a derivation

The chain is querried for the revealed hashes and fetches the package. The size and retained references are checked.

FYI: The careful reader has surely noticed that there is an economical problem. If substituters are rewarded according to f(build time) the token supply is unpredictable which might lead to the unintended consequence that building derivations leads to a reduction of the token value. A possible solution would be that instead of creating tokens out of thin air an amount of tokens are allowed to be created according to the set inflation goals and the rest is financed through a token redistribution scheme. This would incentivise all participants to be substituters.

The times are used to prevent substituters from taking too long, giving them time to patch the derivation or perform other malicious activities on the binary. If this is an effective method requires further analysis.

The publisher submitting a hash is to keep publisher from publishing derivations that don't build.

Are the retained references and the size of the output really enough proof of the derivation being built? Looking for the previous version of the derivation and guessing that the retained references are the same might work too often. Getting it wrong must be punished severely. I'm not sure if build non determinism can lead to a different set of retained references. Build non determinism can definitely lead to different sizes of the binary, if for example simd instructions are used depending on the cpu. Maybe the file layout of the output directory is a better proof?

vcunat commented 5 years ago

On a shallow read, the gist looks good to me; exception: the blockchain layer, details below. Note that I've never looked into details of graphsync, bitswap, etc.

I don't know... trying to outsource the building work to untrusted machines and even make it a working (non-gift) economy – that seems a way too large project to squash it into this one. And I can't see why try doing "everything at once". The way packages get built and the way they are distributed – IMO that's perfectly separable, even in your proposed design.

In your design I think I see a central authority that verifies packages before publishing them. We do have a similar central component already, and it just adds a signature into the meta-data of each binary package. ATM I fail to see what is hoped to be gained by adding a blockchain layer, but in any case I hope these will be real layers (i.e. exchangeable for different approaches).

dvc94ch commented 5 years ago

Yeah, I agree. The main reason is because blockchain stuff is a nice way to get funding :) But it seems like I have a new gig lined up so... After refining the concept and exploring other approaches at https://github.com/package-chain/research I'm focusing on the file system layer for now.

I'm not sure it is possible, since nixpkgs may publish the manifest before the verification has happend on the chain. We can introduce fake publications to keep the validators honest, but anytime nixpkgs publishes something the validator can assume that it is not a fake publication and not validate the build.

Ericson2314 commented 4 years ago

Oh, I guess I never found this thread to mention https://github.com/ipfs/devgrants/blob/master/open-grants/open-proposal-nix-ipfs.md, and now https://blog.ipfs.io/2020-09-08-nix-ipfs-milestone-1/.

dvc94ch commented 4 years ago

Out of curiosity, has the pinning system and lack of support for transactions produced any issues yet? Are you booting from packages stored in ipfs or do you duplicate the data in ipfs and in the nix store?

Ericson2314 commented 4 years ago

pinning system

We turn Nix temporary pins into IPFS pins. We have no notice any issues yet, but with enough concurrent GCing I'm sure something would turn up.

Lack of support for transactions

Nix also doesn't have a notion of transactions. They both should, along with transaction-scoped temporary pins, but I'm not going to worry too much about IPFS lacking something Nix also lack, so I couldn't use it anyways. These things can be fixed.

Do you duplicate the data in ipfs and in the nix store?

We duplicate. I would like not to, but at that point, I'd basically be reimplementing Nix. Again this is something that shouldn't be a show-stopper today, and with enough momentum it can be improved in the future.

Are you booting from packages stored in ipfs

We work both Nix code as it exists, and but can turn suitably formated IPLD objects directly into git paths preserving the CID. In milestone 2 we have toy example of booting from some static binaries gotten from IPFS.

dvc94ch commented 4 years ago

Actually nix does have a notion of transactions. It clearly describes in the PhD thesis how transactions are used and how to keep the nix store consistent with the db. It has invariants that need to be preserved, for example you can only add a package if all it's dependencies are installed (there is a more formal definition of I think at least 5 invariants). But it sounds like you're using it to download a nar file and then you discard the file after installation, in which case it's not relevant for correctness.

dvc94ch commented 4 years ago

I agree it can be fixed, ipfs-embed is trying to fix these issues. As I use ipfs-embed more and understand the problems better it'll improve. I'm currently rewriting ipfs-embed to fully support transactions. It uses a single threaded writer and a wal. This means that if you update a multiblock data structure you'll either have all the changes or none in case of a crash and not a half updated dag. There are a bunch of issues I wrote about before. Blocks are pinned recursively, I doubt that's atomic, so you can be left with something weird. At least they use bools which means you can repin, because the operations is idempotent. On the other hand if you pin two things recursively that share a block, and then unpin one, you're left with inconsistent data, etc. Glad to discuss with you these issues and how if you're serious building a system that can boot from unixfs ;)

Ericson2314 commented 4 years ago

Nix has some basic transaction functionality internally, but you want transactions externally so uses can stitch together the arbitrary commands, e.g.

path=$(nix add-to-store ...)
nix-build ...$path....
nix copy ....

IMO for transactions to be worthy of the name, they should be composable at every level, and Nix the best nix offer for the above is symlinks which need to be cleaned up by hand. That isn't ergonomic enough that people will actually get it right in practice.

ipfs-embed

Yes I have some optimism we'll eventually see Nix rewritten in Rust, in which case being able to use a thing like that would be very nice. I would like, for example, for derivations to not just produce file data but arbitrary IPLD, and to make that work something like ipfs-embed really helps.

Glad to discuss with you these issues and how if you're serious building a system that can boot from unixfs

Glad to too, though keep in mind we are not using unixfs at this time but git blog and tree hashing.

dvc94ch commented 4 years ago

I think that the entire backend can be generic and doesn't have to be nix specific. Derivation builder, nix store etc and the current nix could be ported to submit derivations. That would open the door to building package managers for cargo/npm that interact well with it. Cargo can emit a build plan that could be converted into derivations. Although I have not used nix or guix in years, so I'm not sure what the experience is now. I basically transitioned from nix to guix and then to arch out of disagreement about supporting proprietary firmware. The gnu people seemed to have some impractical ideas, which weren't amenable to reasoning. Apart from the fact that not everyone wants to buy old and rms approved hardware, I reasoned that the hardware is already trusted so it makes no practical difference if it's in silicone or firmware and if a user stops trusting a manufacturer after for example an aquisition he can stop updating the firmware. Also they thought that linux was filled with GB of binary blobs, which after further investigation turned out to be false. There were very few blobs from a small set of very old drivers that hadn't been maintained. Linux libre is mostly about preventing loading of firmware that isn't rms approved, there is no real reason to use it other than annoying devs and forcing them to build custom kernels.

Ericson2314 commented 4 years ago

That would open the door to building package managers for cargo/npm that interact well with it. Cargo can emit a build plan that could be converted into derivations.

This is in fact my long-term goal.

I think that the entire backend can be generic and doesn't have to be nix specific. Derivation builder, nix store etc and the current nix could be ported to submit derivations.

I think the derivation format is the generic format. Nix plan already allows one to make derivations however you like, you don't have to use the nix language. Indeed this is how Guix forked Nix and hasn't had to change the daemon and libnixstore very much. The derivations we do in IPLD in m2 are especially nice to work with via other tools.