mafintosh / hyperdb

Distributed scalable database
MIT License
753 stars 75 forks source link

The definitive replication and authorization guide #153

Open mjp0 opened 5 years ago

mjp0 commented 5 years ago

I think we need to hash out and clarify the replication and authorization processes a bit. I have been struggling with this for many days now and based on issues here I'm not alone, so I'm hoping we can use this issue to clear things up.

After reading the documentation, reading the tests, going over the code and issues, I still can't get this right, so I think I'm not far off if I say that the whole replication process is not intuitive to implement. There are so many details to grok so let's go over it one step at a time. The biggest issue seems to be that it's not clear how the hyperdbs need to be set up for replication to work.

Here are two scenarios that I want to figure out, but it seems I can't.

Scenario 1: I have hyperdb 1 that I want to read & write with hyperdb 2

My logic:

  1. Create 2nd hyperdb with 1st hyperdb's local.key
  2. Authorize 2nd hyperdb to write 1st hyperdb
  3. Create replicate() streams from both hyperdbs
  4. Create socket connection between machines and do stream1st.pipe(socket) and socket.pipe(stream_2nd)

That's the structure I got from the docs etc. but it doesn't seem to work. There's the issue with "first hypercore must be the same" which I guess means I have to create 2nd hyperdb with hyperdb(storage, hyperdb_1st.local.key). With that, I can see the connection happening, but nothing gets replicated. What steps are missing here?

Scenario 2: I want to replicate another hyperdb without writing

My logic:

  1. Create own hyperdb with remote hyperdb's local.key?
  2. Create replicate({ live: true }) stream for my hyperdb
  3. Connect to remote hyperdb's socket
  4. Do socket.pipe(my_hyperdb_stream)

With this I don't get any errors, I see data going over the socket, and I can see the data in the remote hyperdb, but nothing shows up in my hyperdb. It's a bit like the replication doesn't start for some reason.

pfrazee commented 5 years ago

@0fork I wont be able to help on this but I do just want to mention that hyperdb and multiwriter and the networking stack are actively getting worked on, so we do know there are problems and we're working to improve on these flows.

mjp0 commented 5 years ago

Here's a code that's suppose to create hyperdb1 to be cloned to hyperdb2 via hyperswarm/network. As you can see from running it that sockets are receiving data but no writes are being passed from hyperdb1 to hyperdb2.

What am I missing here?

const hyperdb = require("hyperdb")
const network = require("@hyperswarm/network")
const cr = require("crypto")

// this is meant to "simulate" two separate servers so two networks
const net1 = network()
const net2 = network()

const $key = "194e841187f33843b246a796f8a6aceb0d8d5d22b36661e8500b4d693a31e7e5"
const id = cr.createHash("sha256").update($key).digest()
net1.discovery.holepunchable((err, yes) => {
  if (err || !yes) {
    console.log("no hole")
    process.exit()
  }
})

const db1 = hyperdb(`./_test1`, $key, { valueEncoding: "utf-8" })
let db2
let rep1
let rep2
db1.ready(() => {
  console.log("hyper1 created")
  rep1 = db1.replicate({ live: true })
  net1.join(id, {
    lookup: true,
    announce: true,
  })
  net1.on("connection", (socket) => {
    // this is suppose to "push" so piping from
    // rep1 to socket (rep2) to rep1
    rep1.pipe(socket).pipe(rep1).on("end", function() {
      console.log("socket1 pipe end")
    })
    socket.on("data", (data) => {
      console.log("socket1 got data", data)
    })
  })
  db2 = hyperdb(`./_test2`, $key, { valueEncoding: "utf-8" })
  db2.ready(() => {
    console.log("hyper2 created")
    rep2 = db2.replicate({ live: true })
    net2.join(id, {
      lookup: true,
      announce: true,
    })
    net2.on("connection", (socket) => {
      // this is suppose to replicate so piping from
      // socket (rep1) to rep2 to socket (rep1)
      socket.pipe(rep2).pipe(socket).on("end", function() {
        console.log("socket2 pipe end")
      })
      socket.on("data", (data) => {
        console.log("socket2 got data", data)
      })
    })
    db2.watch("/test", (err, data) => {
      console.log("socket2 /test", data)
    })
  })
})

setInterval(function() {
  db1.put("/test", "test", () => {
    db1.list((err, list) => {
      console.log("1", list)
    })
    db2.list((err, list) => {
      console.log("2", list)
    })
  })
}, 3000)
pfrazee commented 5 years ago

@0fork Your broader point about needing a guide is on point. I debugged your script and was only able to do so because I know about some gotchas.

I made a few changes but there were only two that mattered:

  1. I changed the use of @hyperswarm/network to only have one side announce and the other side lookup. That's because hyperswarm doesn't yet have connection deduplication builtin, and so you were getting more connections than you need. We're either going to have dedup builtin to the code, or we'll put that pattern in the readme once we've got one written.
  2. You were providing a public key to both hyperdb instances, but that will only work if you already have the private key to match. Not supplying the public key to the first instance solved that -- if the archive already exists it'll load the key from disk, and if it doesnt already exist it'll mint a new keypair.

Here's the fixed snippet:

const pump = require("pump")
const hyperdb = require("hyperdb")
const network = require("@hyperswarm/network")
const cr = require("crypto")

// this is meant to "simulate" two separate servers so two networks
const net1 = network()
const net2 = network()

const db1 = hyperdb(`./_test1`, { valueEncoding: "utf-8" })
let db2
db1.ready(() => {
  console.log("hyper1 created")

  console.log('swarming')
  const $key = db1.key
  const id = cr.createHash("sha256").update($key).digest()
  net1.discovery.holepunchable((err, yes) => {
    if (err || !yes) {
      console.log("no hole")
      process.exit()
    }
  })

  net1.join(id, {
    lookup: false,
    announce: true,
  })
  net1.on("connection", (socket) => {
    console.log('net1 got connection')
    // this is suppose to "push" so piping from
    // rep1 to socket (rep2) to rep1
    var rep = db1.replicate({ live: true })
    pump(rep, socket, rep, function() {
      console.log("socket1 pipe end")
    })
    socket.on("data", (data) => {
      console.log("socket1 got data", data)
    })
  })
  db2 = hyperdb(`./_test2`, $key, { valueEncoding: "utf-8" })
  db2.ready(() => {
    console.log("hyper2 created")
    net2.join(id, {
      lookup: true,
      announce: false,
    })
    net2.on("connection", (socket) => {
      console.log('net2 got connection')
      // this is suppose to replicate so piping from
      // socket (rep1) to rep2 to socket (rep1)
      var rep = db2.replicate({ live: true })
      pump(rep, socket, rep, function() {
        console.log("socket2 pipe end")
      })
      socket.on("data", (data) => {
        console.log("socket2 got data", data)
      })
    })
    db2.watch("/test", (err, data) => {
      console.log("socket2 /test", data)
    })
  })
})

setInterval(function() {
  db1.put("/test", "test", () => {
    db1.list((err, list) => {
      console.log("1", list)
    })
    db2.list((err, list) => {
      console.log("2", list)
    })
  })
}, 3000)
mjp0 commented 5 years ago

@pfrazee aaah, thank you! I knew it was some small mistake in the configuration instead of a bug in the library code.

Authorization seems to work just by adding db1.authorize(Buffer.from(db2.local.key, "hex"), () => {}) in db2.ready(() => { ... }) so no problems there.

So to summarize: The biggest gotcha seems to supplying the key pair right. The problem for me was that everything seemed to initialize right without errors so I assumed hyperdb had everything it needed. I'm still a bit unclear why hyperdb1 writes worked if only public key was supplied because doesn't hypercore sign each chunk with the secret key to be verified with the public key?

pfrazee commented 5 years ago

@0fork Yeah I think the reason it doesn't fail if you supply a public key is to meet the usecase of db2, where you're joining a pre-existing hyperdb as a second author. I agree that's a footgun though.

reconbot commented 5 years ago

I just want to say this thread is an education. Thank you for doing this out in the open! 👏

lachenmayer commented 5 years ago

Hey folks, I wrote a pretty detailed guide about authorization & replication in hyperdb. Hope it's useful!

philcockfield commented 5 years ago

Much appreciated @pfrazee for the snippet connecting hyperdb to a swarm via hyperswarm - and @lachenmayer for the guide on auth and replication. Both extremely useful...thanks 👏