Closed shea256 closed 6 years ago
@shea256 thanks for this writeup. The problem I see is that the proposed method requires there being a relationship between the "bucket" name and the files inside of it: I need to know the key ryan-may2015
to query one of the items inside, or there needs to be an index mapping bucket name to its items somewhere else.
In #79, we will be hashing all of the events in our system at an interval, to parallel your example:
The problem is that there is no relationship between the timestamp-based key and the array of canonical identifiers in the json file, so there would need to be an index somewhere else.
Question:
The problem is that there is no relationship between the timestamp-based key and the array of canonical identifiers in the json file, so there would need to be an index somewhere else.
Can you elaborate on this?
1) Can this index mapping the namespace key to the individual items in the json be decentralized and have the ability to be queried in an efficient manner?
Yes. In my example, looking up ryan-may2015.birthday-1
would simply look up ryan-may2015
in a particular namespace and then would grab that JSON file in the DHT and then skip down to the entry with the key "birthday-1". For all intensive purposes, a lookup with a subdomain/subkey would be the same as a lookup on a normal domain/key. Completely decentralized.
2) Alternatively, can multiple namespace registrations be hashed in one BTC transaction, instead of using one transaction per registration? This will allow leveraging the DHT and its index to query.
I suppose in theory we could support this but what is the difference between this and using the subdomain method?
Can the timestamp-based key have forced uniqueness based on calculation? As in, is there a hash-derived component and if not, can it be added to it? If it can be added, then we would now have unique keys for each image entry, not enforced by the namespace but by the hashing function.
If the answer is no to the above (and you're using something like UUIDs), can we have timestamp-based keys for groups of images, and have the unique key for an image be the timestamp-based key combined with the local key: de305d54-75b4-431b-adb2-eb6b9e546014.12
?
A potential hash-derived component is illustrated in https://github.com/ipfs/faq/issues/15, but it seems infeasible as a key because of the amount of collisions possible when you have a very big dataset. Welcome your feedback on that issue in ipfs.
Adding the time-based key to the local key sounds like it should work, since now you know which bucket to look in. Will think some more about this.
A potential hash-derived component is illustrated in ipfs/faq#15, but it seems infeasible as a key because of the amount of collisions possible when you have a very big dataset. Welcome your feedback on that issue in ipfs.
What do you mean by collisions? Collisions with hashing functions shouldn't even be a concern.
Adding the time-based key to the local key sounds like it should work, since now you know which bucket to look in. Will think some more about this.
Cool, let me know what you think when you've gotten to mull on it for a bit. I think this might be the way to go.
@denisnazarov Any updates to how you're thinking about this?
@denisnazarov @shea256 @muneeb-ali
- Can this index mapping the namespace key to the individual items in the json be decentralized and have the ability to be queried in an efficient manner?
I'm guessing the reason you ask about decentralization is because the reader wants assurances that you're not lying when you resolve "de305d54-75b4-431b-adb2-eb6b9e546013" to this metadata? If so, then if we can set it up so that you can't convincingly lie anyway, then I'd argue that the degree of centralization doesn't matter here.
In #79, we will be hashing all of the events in our system at an interval, to parallel your example:
- Register a timestamp-based key in a namespace for image storage at some cost-effective interval
- Create a JSON file with N canonical identifiers
Conceptually, what if you just treated each content's metadata like a file in a git repository? I'm not suggesting actually using git, but the point is that you would maintain the history of changes to each metadata record as a Merkle tree. Then, once the reader had the latest Merkle root for the metadata, they could "clone" it.
The challenge for the reader is in getting an authentic and fresh Merkle root for a given content metadata record. What you could do is pull the commit histories for each metadata record that changed in the interval of time between snapshots, build a Merkle tree out of them, and include the root in a blockchain transaction. This basically attests that "These are the sequences of edits to all pieces of content that changed in this time interval."
Once you do this, you could run an arbitrary index service, since the reader would be able to catch you if you were lying. The reader uses the service to resolve the CCID to the Merkle root of the snapshot in the blockchain (which it can verify is authentic by auditing the blockchain). The reader verifies that this a valid Merkle root by "cloning" the metadata (performing the integrity check), and verifying that it has the right CCID. The reader verifies that this is the latest Merkle root by auditing your snapshots in the blockchain. The snapshot root would resolve to the snapshot's Merkle tree, and each hash in the snapshot Merkle tree would resolve to the sequence of new Merkle roots observed for the metadata at that snapshot's point in time. The reader simply confirms that the latest Merkle root in the sequence is the same as the Merkle root of the metadata's edit history that it just cloned. You could store the index itself as well as the JSON documents in the DHT.
Would this work?
From what I'm reading here it seems like the system has changed significantly since this discussion. I'm going to close this. Reopen of I'm wrong!
Currently, you can register a key-value pair on blockstore and perform lookups on the pair by entering the key and getting back the value.
This works great for certain applications like registering user identities, but it runs into problems when other types of objects need to be registered in the blockchain.
For example, in order to validate/timestamp/prove the existence of a billion documents in the blockchain, this would require a lot of transactions, a lot of data embedding, and a lot of money, to such a ridiculous extent that this process would not be feasible.
A way to get around this would be to pack the registration of multiple objects into a single transaction. A single transaction/key-value pair registration could have a single unique key associated with the merkle root hash of a merkle tree, where each item in the merkle tree is the hash of an object to be registered.
So to register 128 image files on the blockchain, you could do the following:
To perform lookups, you would need to provide the domain and the subdomain of the packed file.
NOTE: the files can't realistically be stored in the DHT as that would lead to an insane amount of DHT bloat (imagine storing 1B 100KB avg. images). We're thinking of controlling bloat by imposing a restriction of a bitcoin transaction for each stored item (with a max size), and packing multiple files into a single transaction wouldn't be compatible with this system.
Here's an example of how this would work:
Let's say I have an album of 128 images from my birthday party and I want to register them in the blockchain.
And let's say their filenames are as follows:
To do this, I first register "ryan-may2015" in a namespace for image storage.
Next, I create a JSON file and include that as the value associated with the name (where the JSON file goes into the DHT and the hash of the file is put in the blockchain):
Last, I make sure to store my files in my personal content-addressed fileserver.
That means my file
2015-05-19-birthday-1
would be stored here:Then, I can perform lookups as follows:
...and I would get back the raw file.
This new blockstore lookup would do the following:
Yay, I just registered an entire photo album in the blockchain!
I think we could realistically store 100-1K files in each transaction, and maybe 10K if we pushed the data limits of the DHT.
Thoughts @denisnazarov @muneeb-ali @jessewalden @moudy?