stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 672 forks source link

suggestion: add support for packing many files into a single transaction #81

Closed shea256 closed 6 years ago

shea256 commented 9 years ago

Currently, you can register a key-value pair on blockstore and perform lookups on the pair by entering the key and getting back the value.

This works great for certain applications like registering user identities, but it runs into problems when other types of objects need to be registered in the blockchain.

For example, in order to validate/timestamp/prove the existence of a billion documents in the blockchain, this would require a lot of transactions, a lot of data embedding, and a lot of money, to such a ridiculous extent that this process would not be feasible.

A way to get around this would be to pack the registration of multiple objects into a single transaction. A single transaction/key-value pair registration could have a single unique key associated with the merkle root hash of a merkle tree, where each item in the merkle tree is the hash of an object to be registered.

So to register 128 image files on the blockchain, you could do the following:

  1. hash all 128 image files
  2. build a merkle tree from the hashes
  3. register a unique name
  4. include in the value/JSON file a list of file object descriptors, where each descriptor has a locally unique subdomain and the hash of the file
  5. include in the value a root URL where the files can be served

To perform lookups, you would need to provide the domain and the subdomain of the packed file.

NOTE: the files can't realistically be stored in the DHT as that would lead to an insane amount of DHT bloat (imagine storing 1B 100KB avg. images). We're thinking of controlling bloat by imposing a restriction of a bitcoin transaction for each stored item (with a max size), and packing multiple files into a single transaction wouldn't be compatible with this system.

Here's an example of how this would work:

Let's say I have an album of 128 images from my birthday party and I want to register them in the blockchain.

And let's say their filenames are as follows:

2015-05-19-birthday-1
2015-05-19-birthday-2
2015-05-19-birthday-3
...

To do this, I first register "ryan-may2015" in a namespace for image storage.

Next, I create a JSON file and include that as the value associated with the name (where the JSON file goes into the DHT and the hash of the file is put in the blockchain):

{
    "fileservers": ["https://cdn.shea.io/albums/may2015/"],
    "files": {
        "birthday-1": "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb",
        "birthday-2": "3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d",
        "birthday-3": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
        ...
    }
}

Last, I make sure to store my files in my personal content-addressed fileserver.

That means my file 2015-05-19-birthday-1 would be stored here:

https://cdn.shea.io/albums/may2015/ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb

Then, I can perform lookups as follows:

blockstore-cli lookup ryan-may2015.birthday-1

...and I would get back the raw file.

This new blockstore lookup would do the following:

  1. lookup the name "ryan-may2015" in the index and get the associated hash
  2. lookup the JSON file by the hash in the DHT
  3. find the entry in the JSON file with the key "birthday-1" and grab the hash
  4. issue a request to the fileserver and inspect the response to make sure the file returned matches the hash
  5. return the response to the client

Yay, I just registered an entire photo album in the blockchain!

I think we could realistically store 100-1K files in each transaction, and maybe 10K if we pushed the data limits of the DHT.

Thoughts @denisnazarov @muneeb-ali @jessewalden @moudy?

denisnazarov commented 9 years ago

@shea256 thanks for this writeup. The problem I see is that the proposed method requires there being a relationship between the "bucket" name and the files inside of it: I need to know the key ryan-may2015 to query one of the items inside, or there needs to be an index mapping bucket name to its items somewhere else.

In #79, we will be hashing all of the events in our system at an interval, to parallel your example:

  1. Register a timestamp-based key in a namespace for image storage at some cost-effective interval
  2. Create a JSON file with N canonical identifiers

The problem is that there is no relationship between the timestamp-based key and the array of canonical identifiers in the json file, so there would need to be an index somewhere else.

Question:

  1. Can this index mapping the namespace key to the individual items in the json be decentralized and have the ability to be queried in an efficient manner?
  2. Alternatively, can multiple namespace registrations be hashed in one BTC transaction, instead of using one transaction per registration? This will allow leveraging the DHT and its index to query.
shea256 commented 9 years ago

The problem is that there is no relationship between the timestamp-based key and the array of canonical identifiers in the json file, so there would need to be an index somewhere else.

Can you elaborate on this?

1) Can this index mapping the namespace key to the individual items in the json be decentralized and have the ability to be queried in an efficient manner?

Yes. In my example, looking up ryan-may2015.birthday-1 would simply look up ryan-may2015 in a particular namespace and then would grab that JSON file in the DHT and then skip down to the entry with the key "birthday-1". For all intensive purposes, a lookup with a subdomain/subkey would be the same as a lookup on a normal domain/key. Completely decentralized.

2) Alternatively, can multiple namespace registrations be hashed in one BTC transaction, instead of using one transaction per registration? This will allow leveraging the DHT and its index to query.

I suppose in theory we could support this but what is the difference between this and using the subdomain method?


Can the timestamp-based key have forced uniqueness based on calculation? As in, is there a hash-derived component and if not, can it be added to it? If it can be added, then we would now have unique keys for each image entry, not enforced by the namespace but by the hashing function.

If the answer is no to the above (and you're using something like UUIDs), can we have timestamp-based keys for groups of images, and have the unique key for an image be the timestamp-based key combined with the local key: de305d54-75b4-431b-adb2-eb6b9e546014.12?

denisnazarov commented 9 years ago

A potential hash-derived component is illustrated in https://github.com/ipfs/faq/issues/15, but it seems infeasible as a key because of the amount of collisions possible when you have a very big dataset. Welcome your feedback on that issue in ipfs.

Adding the time-based key to the local key sounds like it should work, since now you know which bucket to look in. Will think some more about this.

shea256 commented 9 years ago

A potential hash-derived component is illustrated in ipfs/faq#15, but it seems infeasible as a key because of the amount of collisions possible when you have a very big dataset. Welcome your feedback on that issue in ipfs.

What do you mean by collisions? Collisions with hashing functions shouldn't even be a concern.

Adding the time-based key to the local key sounds like it should work, since now you know which bucket to look in. Will think some more about this.

Cool, let me know what you think when you've gotten to mull on it for a bit. I think this might be the way to go.

shea256 commented 9 years ago

@denisnazarov Any updates to how you're thinking about this?

jcnelson commented 9 years ago

@denisnazarov @shea256 @muneeb-ali

  1. Can this index mapping the namespace key to the individual items in the json be decentralized and have the ability to be queried in an efficient manner?

I'm guessing the reason you ask about decentralization is because the reader wants assurances that you're not lying when you resolve "de305d54-75b4-431b-adb2-eb6b9e546013" to this metadata? If so, then if we can set it up so that you can't convincingly lie anyway, then I'd argue that the degree of centralization doesn't matter here.

In #79, we will be hashing all of the events in our system at an interval, to parallel your example:

  1. Register a timestamp-based key in a namespace for image storage at some cost-effective interval
  2. Create a JSON file with N canonical identifiers

Conceptually, what if you just treated each content's metadata like a file in a git repository? I'm not suggesting actually using git, but the point is that you would maintain the history of changes to each metadata record as a Merkle tree. Then, once the reader had the latest Merkle root for the metadata, they could "clone" it.

The challenge for the reader is in getting an authentic and fresh Merkle root for a given content metadata record. What you could do is pull the commit histories for each metadata record that changed in the interval of time between snapshots, build a Merkle tree out of them, and include the root in a blockchain transaction. This basically attests that "These are the sequences of edits to all pieces of content that changed in this time interval."

Once you do this, you could run an arbitrary index service, since the reader would be able to catch you if you were lying. The reader uses the service to resolve the CCID to the Merkle root of the snapshot in the blockchain (which it can verify is authentic by auditing the blockchain). The reader verifies that this a valid Merkle root by "cloning" the metadata (performing the integrity check), and verifying that it has the right CCID. The reader verifies that this is the latest Merkle root by auditing your snapshots in the blockchain. The snapshot root would resolve to the snapshot's Merkle tree, and each hash in the snapshot Merkle tree would resolve to the sequence of new Merkle roots observed for the metadata at that snapshot's point in time. The reader simply confirms that the latest Merkle root in the sequence is the same as the Merkle root of the metadata's edit history that it just cloned. You could store the index itself as well as the JSON documents in the DHT.

Would this work?

jackzampolin commented 6 years ago

From what I'm reading here it seems like the system has changed significantly since this discussion. I'm going to close this. Reopen of I'm wrong!