DAG Replicator use case

Gozala commented 1 year ago

Creating an issue based on offline interaction with Daniel who is trying to build OrbitDB like sync protocol on top of web3.storage

Quoting key points so we can continue discussion in the open

I saw this issue https://github.com/web3-storage/w3protocol/issues/150 and sounds like something I need to build a persistent replicator that uses web3.storage and w3name to persist replica syncing.

I'm looking for something that allows updating or pushing more pieces to a larger CAR file so I don't have to push entire replicas after every update.

It would be nice to be able to request just the shards as well but probably not going to happen which isn't a problem for now since I can use ipfs with web3.storage. At that point its almost a jank version of pinning and pulling IPLD DAGs.

Gozala commented 1 year ago

I'm looking for something that allows updating or pushing more pieces to a larger CAR file so I don't have to push entire replicas after every update.

what’s the motivation for this ? Specifically why do you need all of the pieces to be in the same car ?

mutating CARs is not something we’re considering. However we do have few things that may address your needs differently.

store/add now has optional origin field that you can point to a previous shard of the DAG. So basically a CAR with preceding pieces.
- Obviously you may also consider linking from the block in the CAR itself. I’d guess this might be more appropriate for your use case. You may want to do both however, that way our service will know about relationship and might do more things witg that in the furture
upload/add operation allows you to publish DAG root with optional shards pointing to all the CAR’s it’s block are in.

With above two I suspect you may have all the things you need, but please follow up so we can ensure.

It would be nice to be able to request just the shards as well but probably not going to happen which isn't a problem for now since I can use ipfs with web3.storage. At that point its almost a jank version of pinning and pulling IPLD DAGs.

CARs have CIDs and are stored as is in our system. Technically speaking it should be no problem at all to serve those CARsby their CID from our gateway, that said I’m not entirely sure if it works today. If it doesn’t please create an issue for that and I’m sure we’ll be able accommodate.

tabcat commented 1 year ago

what’s the motivation for this ? Specifically why do you need all of the pieces to be in the same car ?

Your right, they dont need to be in the same car. I just need a car library that allows adding incomplete DAGs; which any car library should be able to make since its not required as part of the spec. When using the js-car library this was an issue a few months ago.

That is the main piece needed and then everything could be pulled from ipfs which should work well. :+1:

I didn't know what 'CAR sharding' was but sounded like something I might have been able to use.

Gozala commented 1 year ago

Your right, they dont need to be in the same car. I just need a car library that allows adding incomplete DAGs; which any car library should be able to make since its not required as part of the spec. When using the js-car library this was an issue a few months ago.

Our client uses js-car and more specficially CARBufferWriter which we have added so we could allocate CAR of certain size, pack it with blocks and send it to the web3.storage.

This seems to work really well with our @ipld/unixfs which we use to turn files and dirs into DAGs, because it spits out blocks as soon as they are ready. We put them into preallocated CAR once it's full, we send it off and continue with another CAR shard and so on until no more blocks are left.

Each CAR packet links to previous CAR packet and once all are uploaded, client sends upload/add with links to all the CARs.

Gozala commented 1 year ago

I didn't know what 'CAR sharding' was but sounded like something I might have been able to use.

We just call shards a partial fs DAGs which are currently encoded as CAR files. We call them shards as opposed to just CARs because we may have different representations in the future and semantically it's a shard of the DAG.

Gozala commented 1 year ago

In term of how this would fit opal/orbitdb, I have limited context so I not all of it may make sense.

I imagine that replica will write local changes as blocks in some CAR buffer, let's call it a "changeset".
Once "changeset" is ready (which could be because it reaches certain size, or certain amount of time has passed or user stopped interacting with app, domain specific logic essentially) you can store/add that changeset.
Each "changeset" would want to link to previous "changeset" via origin field, but again that's not required.
When replica wants to publish state of the replica, it creates sends upload/add request with root pointing to the DAG root CID and shards pointing to all the "changesets" it deems relevant.
- I suspect that linking to all of the "changeset"s may not be optimal here as it will only grow, so probably linking to just "changesets" added in this from previous upload would be a better option.
- Please note that upload/add isn't required at all it's just what user will see in the upload list and maybe it's irrelevant for your use case. All the CIDs inside uploaded CARs will still remain available.
Replica can publish new IPNS record with a new root of the DAG.

Please note that opal/orbitdb could incorporate CAR cids in it's data structure so it could get "changesets" in single roundtrip, or it could completely ignore that and get each blocks by it's CID regardless which account / CAR it is in.

Also note that ☝️ says including only local changesets in the CAR with an assumbtion that remove changes are stored by their authors. That said it's certainly possible to include those changes to ensure that you have a copy even if author deletes them.

Finally if your protocol / data structure uses CAR CIDs you could store those CARs into your account and without re-uploading it if we have that CAR already, because we just add it to your account and bill you accordingly.

tabcat commented 1 year ago

This sounds good :+1:

Please note that opal/orbitdb could incorporate CAR cids in it's data structure

I'll have to think about this one some more. I may not incorporate CAR cids in the base data structure but the replicator might try to do something like this.

Also note that point_up says including only local changesets in the CAR with an assumbtion that remote changes are stored by their authors. That said it's certainly possible to include those changes to ensure that you have a copy even if author deletes them.

remote and local changes would be included, anything that has been added to the local replica should be available under the IPNS record for that peer>database.

I'll share anything related I make here. Hopefully won't be long...

Gozala commented 1 year ago

remote and local changes would be included, anything that has been added to the local replica should be available under the IPNS record for that peer>database.

Well even if don't include remote changes in the CAR your DAG will still link to them so publishing root to IPNS technically includes those changes. That said, those blocks may or may not be reachable if you don't save it in your account so including might be a better choice.

tabcat commented 1 year ago

Right, I meant to say that they would [need to] be included in the CAR for that reason.

storacha / w3up

DAG Replicator use case #154