While https://github.com/holochain-open-dev/file-storage exists and it is a decent proof of concept, it has a little too much focus on storing files specifically. We need something that stores binary blobs which could be used as a basis for either storing UI components, larger pieces of data in the space zome, or files.
blob - the complete binary data we want to store
sub-blob - a chunk of the binary blob
At the minimum we need something that stores a sequence of sub-blobs to the the DHT in a way that is easy to verify the data was properly reconstructed.
At the least we need to store the size, a SHA-256 hash, and an ordered list of blobs.
At most we create a modified Merkle tree as described in Haber & Stornetta, 1997 with the header being descriptive of the size of the data and pointing to the Merkle tree where the leaf nodes point to the sub-blobs (since they are content addressed by hash, they satisfy the requirements of the Merkle tree). This Merkle tree is very easy to implement in Holochain since each data type gets a hash for free, so we just need to implement a node structure which has two holohash entries and just build the tree up from the hashes of the sub-blobs. However, considering the way things are stored on the DHT, it might be more efficient to just store the whole Merkle tree of hashes as an array minus the leaf nodes, which can be stored in a separate array. That way validation of the whole Merkle tree can optionally happen.
The SHA-256 is probably good enough along with the list of sub-blob hashes. Anything more is feeling like overkill.
Any additional metadata about what it stored can be provided by the object that links to the root node of the blob.
Is there a good reason why splitting the blob into sub-blobs needs to happen on the client?
Thinking through this, it seems like client side is the most reliable way to ensure everything persists without failing. We can't upload the whole blob and save it to the source chain or some other scratch area we can access to then piece it out and store sub-blobs because fo size constraints. If we tried to upload a large file and it did fit in memory, we might not be able to store all the sub-blobs to the DHT before the request timed out.
If we have time, we should implement a service worker for chunking and uploading blobs along with the counterpart for fetching the chunks reconstructing and verifying the file. For other purposes, like streaming media, we will likely need to have a more relaxed way of fetching the chunks and appending them to the buffer.
While https://github.com/holochain-open-dev/file-storage exists and it is a decent proof of concept, it has a little too much focus on storing files specifically. We need something that stores binary blobs which could be used as a basis for either storing UI components, larger pieces of data in the space zome, or files.
blob - the complete binary data we want to store sub-blob - a chunk of the binary blob
At the minimum we need something that stores a sequence of sub-blobs to the the DHT in a way that is easy to verify the data was properly reconstructed.
At the least we need to store the size, a SHA-256 hash, and an ordered list of blobs.
At most we create a modified Merkle tree as described in Haber & Stornetta, 1997 with the header being descriptive of the size of the data and pointing to the Merkle tree where the leaf nodes point to the sub-blobs (since they are content addressed by hash, they satisfy the requirements of the Merkle tree). This Merkle tree is very easy to implement in Holochain since each data type gets a hash for free, so we just need to implement a node structure which has two holohash entries and just build the tree up from the hashes of the sub-blobs. However, considering the way things are stored on the DHT, it might be more efficient to just store the whole Merkle tree of hashes as an array minus the leaf nodes, which can be stored in a separate array. That way validation of the whole Merkle tree can optionally happen.
The SHA-256 is probably good enough along with the list of sub-blob hashes. Anything more is feeling like overkill.
Any additional metadata about what it stored can be provided by the object that links to the root node of the blob.
Is there a good reason why splitting the blob into sub-blobs needs to happen on the client?
If we have time, we should implement a service worker for chunking and uploading blobs along with the counterpart for fetching the chunks reconstructing and verifying the file. For other purposes, like streaming media, we will likely need to have a more relaxed way of fetching the chunks and appending them to the buffer.
For a streaming video example see: https://webtorrent.io/bundle.js (renderMedia); https://github.com/webtorrent/webtorrent/blob/master/lib/file.js For MediaSource docs see: https://code.pieces.app/blog/the-media-source-extension-javascript-api-the-foundation-of-streaming-on-the-web; https://eyevinntechnology.medium.com/how-to-build-your-own-streaming-video-html-player-6ee85d4d078a