Open voltrevo opened 8 years ago
Really interesting, I had a similar (less general) idea sent to the Mediachain team involving just saving the command line parameters to get a file from another as metadata. But this is far more general and cooler!
I'm pretty interested in this for implementing distributed simulations. At the moment, I've got a great big HPC fed by petabytes of data -- and I'm wasting the vast majority of my compute (and a big chunk of storage) redoing computations which have already been done, and reproducing files which have already been created.
It should not be too difficult to take a tool like doit or joblib and modify it so that the cache data is persisted to ipfs, so person A can use intermediate simulation results which have been generated by persons B, C, D...
This is similar to what Mycroft does with memoization of results for determinate-annotated functions.
I think this kind of thing sounds very exciting. Our research group works on enforcing language-level determinism in (parallel) programming, mostly in Haskell.
Nevertheless, I think that something like this should probably start at the container level. Something like a Nix build (or Dockerfile with significant restrictions) could be executed on a runtime that enforces determinism (e.g. the work of @devietti and others).
The language-level story becomes a way of reducing dynamic determinism enforcement overhead. For instance, we could conceivably make it possible to "trust" Haskell code written in a restricted mode so that dynamic overheads (2-4X) can be avoided. Also, significantly lower overheads should be achievable for sequential programs.
We could create an object with the lambdified program in the form of calculus of propositions or a Haskell-like program and just use it or link (via IPLD) equivalent but faster "implementations" in other languages and use those - though programs equivalence can be not easy to prove. This kind of objects could then be used to univocally represent a program as a graph with data and functions. Then, once one did the computation, the result can be linked to the DAG (optionally with using ethereum for trust) so people just need to traverse the graph to get the solution. See https://github.com/ipfs/apps/issues/6 and https://github.com/ipfs/ipfs/issues/47 @jbenet is this the big picture?
I think this would be a great system on top of IPFS, but not sure it should be part of IPFS core.
I would even generalize it more. Maybe instead of having to implement a virtual machine on top of IPFS, we could simply store Docker images on IPFS (with nice deduplication of image layers). Then you can choose if you want to store an image which is deterministic, or an image which is not. For some use cases you care about determinism, about some you do not.
Then on top of that you store computations: inputs + which image to use.
Then you have workers which run those computations and store result, with a link back to computation. Here you have again multiple things to do. You could have your own set of trusted workers picking pending computations on IPFS, and running them, and storing the result in IPFS. Because you trust workers, you trust the output. So you can even do non-deterministic computations in this way.
We could have a set of workers provided by community, which would do something similar to byteswap, running computations for others, if others run computations for them. Here multiple strategies of trust could be possible. One is that few workers run computation and then we compare that all outputs are the same. Similar to how SETI/BOINC works. Or, we could require all workers to run all computations and only then we would trust the output. It really depends on your trust model.
If we have distributed 3D rendering happening on top of IPFS, or scientific computations on top of IPFS, maybe replication by factor 3x is good enough, we do not need everyone to run everything.
Also, Docker images could be run inside a system like Intel SGX, to prove that the have been run without tampering. More info.
So, instead of limiting computation on top of IPFS to only one set of values, I am proposing that we make it layered, so that we can cover various use cases where you might not need determinism because you have maybe trusted workers, or because it is low probability that somebody will do something malicious. And then you can build determinism on top of that, and more strict requirements. See it as IP vs. TCP. IP is serialized computations, and TCP is I want to have some guarantees.
In fact I think that this would create a new Internet. Instead of moving data around, you would move computations around. Each IP packet would be a Docker image, and instead of routing we would have what is input to the Docker image, and what is output. And so on.
I have written more about this idea in this blog post.
//cc @nicola
By the way, another piece of related work is Google's Bazel/Blaze which has a functional evaluation engine at its core for a DAG of deterministic functions. However, determinism is assumed rather than strictly enforced.
Checkout our global neural net supercomputer project (https://computes.io). The network and storage layers of this global computer are now 100% powered by IPFS. The computes layer is made up of signed javascript bytecode "kernel" jobs that can target specific "cores" based on job, location, MIPS, OS, etc.
I think that a bytecode shouldn't limit itself to pure functional programming. I think that something low level like WASM should be sufficient for input/output since it doesn't have access to anything external to what the environment gets set up with.
One could store a WASM program on IPFS, then use any of the many widely available engines to execute it. A program should just define a main method which takes the inputted parameters, and have a way to fetch data from IPFS hashes. Trying to access a hash that doesn't exist should cancel the whole execution. Internally, WASM can use whatever mutable memory it wants since it can't access anything external to the inputs anyways. Every program execution should use a clean block of WASM memory to prevent any leaks, and memory should be dumped after execution.
There are a lot of languages that support WASM as a compile target so porting existing languages is trivial. As well there are different engines for running WASM which should all be behaving the same way since the language standard is fairly precise.
I like @mitar's suggestion to start at the container level. (Though language level, like in this computes.io project is also reasonable. And WASM seems like an especially nice place to standardize, if going the language-level or IR route.)
And, yes, determinism (like other guarantees, even those with mechanized proofs) ultimately comes down to trust. E.g. some containers could be signed to indicate that they were built to include a deterministic sandboxing regime. Certainly, sign-and-trust for a non-deterministic execution can allow the result to be reused without determinism, but determinism adds a nice bonus. Specifically, you can reduce the trust needed for sharing results -- after they are confirmed by N uncorrelated clients.
Unfortunately, there's just not any determinism-sandboxing software that's ready for prime time. On the academic side, our latest iteration of this appeared at OOPSLA'17. In that system, "detflow", we allow language-enforced and runtime-enforced determinism to integrate together.
But there's a lot of work left to do to robustly support arbitrary x86 binaries in a deterministic runtime sandbox. The version in that paper uses LD_PRELOAD to intercept program behavior and is far from 100% watertight, certainly not robust to actual adversarial programs that are trying to break determinism.
@rrnewton WASM might not be doing anything special for enforcing determinism, but the feature set is so small that it doesn't have access to any sources of entropy by default. Any sources of entropy (like time , GPU variations, multithreaded execution) you'd be able to make use of in most language runtimes don't exist and must be passed in from JS. Since all WASM runtimes implement the spec, if we don't provide access to external sources of randomness, then a program written in WASM will always evaluate to the same result given the same inputs.
What I mean is that limiting the scope of what a program has access to will simplify the act of making it deterministic. Of course limiting scope will also limit the types of applications that can be made to run in this environment, but I think that the sort of things that would require massive distributed computation are going to be compute heavy, and not "randomness heavy". For cases where you really need to do things based on time or other sources of entropy, your application doesn't really require it to be deterministic.
Development on computes.io (now computes.com) is progressing nicely! We have pivoted to using Docker containers rather than supporting only Javascript sandboxes. This allows support for TensorFlow (Python) development and all other programming languages and native binaries and COTS (commercial-off-the-shelf) products.
We are using IPFS as our file system, mesh networking, and pubsub messaging. We have also built a blockess ledger on top of the IPFS DAG. This should support faster transaction speeds, smaller footprints, and no wasted proof-of-work computations as compared to blockchain alternatives.
We now have a signup list for the beta scheduled to release Q1 2018 at https://computes.com.
@RangerMauve - I don't yet know much about WASM, so I'm not currently clear on how, given an untrusted piece of code, to best go about ensuring that it doesn't get entropy "passed in from JS". That interface needs to be monitored/protected/sandboxed somehow, right? E.g. a trusted WASM->WASM transpiler would be one way to do this step.
Ultimately I think one will need to tackle the question of how to do parallel, high-throughput programming. But in the short-medium term I think it's the right solution to just cut off sources of entropy and allow sequential programs only. (Well, embarrassingly parallel programs fit are no trouble to support if they expose coarse-grained parallelism as workflows/DAGs of steps that don't share any state...)
@rrnewton MDN has a great overview that describes how to use WASM.
The gist of it is that you take a blob of binary WASM instructions and create an instance
of the module.
When creating the instance you have the option of specifying some imports
which is a map of JS objects whose methods should be accessible to WASM.
e.g.
// Create an instance of a WASM module and give it a `math` package with a `sqrt` function to import
WebAssembly.instantiate(WASM_CODE_HERE, {
Math: {
sqrt: function(n) { return Math.sqrt(n)}
}
});
In your WASM code you now have an import
instruction which will allow you to get a reference to the Math.sqrt function. The way this works is similar to how you can import a function from a C library using its ABI, or use a Foreign Function Interface from another language to and from C.
By default WASM only has access to operations to do with moving bytes within its memory, math, and defining reusable functions, it can't do anything else unless you pass in imports for it. It doesn't have access to the rest of the browser context like the DOM or HTTP
Since the code used for actually running WASM modules in a distributed system gets to decide which imports a module has access to, it can then decide to leave the imports empty, or limit it to whatever subset makes sense for the runtime.
I would propose that the runtime should provide a mechanism for fetching data out of IPFS by hash. This will make it easier to pass arguments to programs since you can tell it to get(hash, callback_pointer)
and have it resume the code with the resulting data after it has loaded. Since hashes can be safely used to identify data, you know that if a given modules gets a hash as an argument it will always fetch the same data regardless of where it's executing. Of course one would also need to add logic to abort execution if a module tries to fetch a hash that cannot be resolved.
@chrismatthieu Since a potential goal is using people's donated browsers, how are you getting docker images to work in those environments? Or are you relying on compute doners to install docker on their machines?
@RangerMauve - It sounds like WASM is well-designed for untrusted code execution. A nice side-effect of determinism, is that by cutting those WASM programs off from timing information, you also cut them off from performing side-channel attacks that require fine-grain timing support ;-).
@rrnewton Yeah! It's really convenient that this showed up right when decentralized computation on the web is becoming more feasible. :P
Something to keep in mind, however, is that WASM has a bunch of features in their roadmap that will introduce randomness (like multithreading, garbage collection, and importing JS modules). However I think that it will be possible to statically analyze WASM binaries and detect the instructions that would make use of these features in order to reject the module from running, if that level of security is required.
@RangerMauve We are initially targeting very large companies with massive computing demands. More than likely, we will offer the Javascript sandbox in the near future for the B2C marketplace and browser support.
It is good to see people getting enthused about this.
For my part, I am now working for a new company, (doing much the same flavour of work as previously) and have started to implement my back-end data processing using Luigi, which lets me code in Python and write my own functions to determine if files are up to date. (i.e. I can query a cache if I want to).
I don't really need language-level guarantees of determinism (I just need to write the requirement and make sure that it gets tested) ... but I would quite like to experiment with using a distributed cache built on IPFS that gets called from my Luigi nodes.
I kind of want to revive this thread by mentioning https://github.com/ewasm/wasm-metering
Also, it would be exciting to evolve compressors in the standardized VM bytecode, and then include either the decompressor itself, or a content hash of the decompressor bytecode with the data.
Somewhat related, this is a proof-of-concept resulted from a "few-days" hackathon. It is a really early WIP, but I think it includes many of the concepts suggested in the issue. Sharing it here in case it sparks some additional inspiration.
Something I want to try is to implement something similar to Atmo over IPFS, I think it could lead to pretty cool use cases over the network.
@adlrocha "something similar to Atmo over IPFS" exactly what I had in mind, but I see you stopped your initial idea, any blocker?
Some cool progress on WASM stuff in the IPFS space is this demo of using WASM to decode IPLD data on the fly.https://github.com/aschmahmann/wasm-ipld
We've also got a WASM chat on the IPFS discord/matrix and the Filecoin slack that we're actively talking about this in.
There's a working group being formed for folks actively working on specifying this.
Also check out this talk on the subject:
Hope this is the right place submit this, if not please let me know.
Anyway, suppose we had a standard bytecode for a deterministic virtual machine. That means any program compiled into that bytecode would always be guaranteed to produce the same output for any given input. Functional languages like lisp, haskell etc obviously come to mind, but really any language could conceivably be used, it only matters that those programs ultimately only have access to their input, and can't get at any sources of randomness (e.g. if you had a js implementation,
Math.random()
would either be a compilation error or rely on a fixed arbitrary seed) or anything mutable like reading from a (mutable) filesystem etc. .Ok so once you have that, I imagine this could be a powerful extension to ipfs. Instead of just asking for a the data represented by a hash, you could in a standard way specify a pair of hashes representing
[input data, program bytecode data]
, and you would get the hash of the output returned. This means you would have to trust the node executed the program correctly, but you could potentially ask multiple nodes the same thing and make sure you get the same response. Because hey, what if I just want to know the size of the file represented by a hash? You could extend ipfs (or maybe this is already in ipfs) to handle this specific case, or you could just have a de-facto standard program which does this for you. People would be incentivised to use existing programs where possible, because then the outputs they seek would be more likely to be already computed by someone else. This way you don't have to extend the protocol every time you think of some useful transform on data that people might care about, like 'scale down the image at this hash', or 'mix these two audio tracks together', 'stitch these videos together', 'render this html page into an image and send it to me' etc. etc. etc. . All of this could be built on top of a deterministic VM to generalize ipfs in the same way that ethereum generalizes bitcoin. If a program runs for too long the ipfs node could return an error saying 'I couldn't complete your computation, but here's a hash to the state of the serialized VM if you want, so you can send that to another node to continue the computation'.Maybe this has been proposed before? Is there an effort towards this already? @jbenet wdyt?