holepunchto / hyperdrive

Hyperdrive is a secure, real time distributed file system
Apache License 2.0
1.86k stars 135 forks source link

Writing files without copying? #274

Closed saranrapjs closed 1 year ago

saranrapjs commented 4 years ago

Hi, I'm a big fan of hyperdrive and hypercore! I've been playing around with the newer hyperdrive API's and they're really cool.

Is it possible to write a file onto a hyperdrive without copying the underlying bytes into the folders/files the hyperdrive persists to disk? I'm assuming something about how hyperdrive stores bytes in hypercores requires those bytes to be present in the data created by the hyperdrive; I tried symlinking files outside the hyperdrive into the hyperdrive, but this throws an error on read (which I assume is expected).

I'm trying to build something a little closer to the BitTorrent model backed by hyperdrive or hypercore, where some index of chunks is broadcast and replicated amongst peers, but where those bytes represented by those chunks don't need to be copied into the data structure created by hyperdrive (or perhaps hypercore) in order to be replicated. For my use case, needing to duplicate giant files that already reside elsewhere outside of a hyperdrive would be prohibitive.

Thanks for any pointers you might have!

andrewosh commented 4 years ago

Hey @saranrapjs, apologies for being slow to respond. At the moment your data does need to be imported into hypercores, and there's no super simple way around that. But given how Hypercore's Merkle trees are stored separately from the data files, it should be possible to generate the trees without storing the data, then to offload data loading to a custom loader module.

Hypercore loads data from a random-access-storage interface that just needs to support read, write, and del functions. There are random-access-* modules for different storage backends (in-memory, for example). It should be possible to use a customized module that uses separate logic for reading/writing hypercore's data files. It could be pretty challenging to write this on your own, because you'll have to map Hyperdrive reads/writes into random-access-storage reads/writes -- and the storage module will be reading at offsets in a single data file, which in your case would be a virtual fle. Earlier versions of Hyperdrive provided this with through a feature called "files as files," but that's not supported in v10.

If you want to take a stab at it, you can pass your custom random-access-storage module in as the first argument to Hyperdrive's constructor.

okdistribute commented 4 years ago

thanks for that overview @andrewosh!

For reference, dat cli has this, but it's the older version of hyperdrive.

Here is that implementation: https://github.com/datproject/dat-storage

saranrapjs commented 4 years ago

Yeah, thank you for the overview! I'm in the midst of trying something similar with hypercores — passing in a random-access-storage instance that more directly manages where/how the data files get stored — on the theory that there are less bells & whistles to account for. For my use case, creating a dedicated, read-only hypercore that creates the metadata bitfield based on data that's never persisted separately to disk + storing its key in hyperdrive may work.

jwerle commented 4 years ago

@saranrapjs though it is not documented, check out the indexing feature of hypercore

here is some untested code that indexes a very large file (generates SLEEP files, bitfield) and uses a file as the data storage

const fs = require('fs')
const raf = require('random-access-file')
const feed = hypercore(createStorage, { indexing: true })

feed.ready(() => {
  fs.createReadStream('largefile').pipe(feed.createWriteStream())
})

function createStorage(filename) {
  if (filename === 'data') { return raf('largefile') }
  return raf(filename)
}

we have a little module that does this too: https://github.com/little-core-labs/hypercore-indexed-file