michaelavila commented 6 years ago

Current work is being tracked here: https://github.com/ipfs/go-ipfs/pull/5870

I'm creating this issue to expose how I'm approaching the provider subsystem and new provider strategies tasks. I will add info as I make progress. If you feel strongly about anything written here, please let me know so we can discuss.

There's a substantial implementation discussion in #5840

There are a handful of related issues: #3813 #4147 #4311 And PRs: #4333 #4113

if it's crossed out, it's completed

Provider Subsystem

Problem

The current method for deciding which blocks to provide is naive and results in performance problems when adding many blocks.

Solution

For a given dag of blocks provide only some subset of those blocks.

Steps

These are some very small steps I'm taking to progress towards the end goal. The first few are so small they wouldn't result in PRs on their own, but I want to give some insight into how I'm approaching the problem. The last few are so vaguely defined they're not actionable, my hope is that they become clearer as we make progress.

~1. Remove bitswap provide~

~The purpose of this step is to remove providing from bitswap and to rely solely on reprovide. This is not optimal.~

2. Introduce provider and provide all strategy, use post add

The purpose of this step is to provide each block that has been added using ipfs add using roughly the mechanism the reprovider uses. This is not optimal.

[ ] Make providing progress bar (ipfs/go-ipfs#4554)

~3. Make the provider robust~

~The purpose of this step is to modify provider so that it provides blocks as efficiently as go-bitswap. This is a little closer to optimal.~

[x] ~use something like the logic go-bitswap currently uses to send provide messages~

~I've deferred bringing the provider and reprovider in line with one another until the very end of all this work. There's a bit to think about that I just don't want to tackle yet. Like, how strategies should be articulated in code and to the user given that provide/reprovide logic is different (e.g. reprovide strategies operate over the entire blockstore), and bigger things like, when reproviding, how to account for blocks that were not provided initially so that they aren't reprovided.~

4. Provide all blocks in the appropriate place in `go-ipfs`

additions and modifications

[x] ipfs add
[x] ipfs block put
[x] ipfs dag put
- [ ] deal with dag put not using core api currently
[x] ipfs object new
[x] ipfs object patch
- [x] set-data
- [x] rmlink
- [x] addlink
- [x] append data
[x] ipfs object put
[x] ipfs tar add
- [ ] deal with tar not using core api currently
gatewayHandler
- [x] postHandler
- [x] putHandler
  - [ ] deal with put not using core api currently
- [x] deleteHandler
[x] ipfs pubsub
- I believe the only block that needs to be provided is the one created during ipfs pubsub sub <topic>

retrievals

[x] ipfs get
- [ ] deal with get not using core api currently
[x] ipfs ls
- [ ] deal with ls not using core api currently
[x] ipfs cat
[x] ipfs refs
- [ ] deal with refs not using core api currently
[x] ipfs block get
[x] ipfs block stat
[x] ipfs object get
[x] ipfs object data
[x] ipfs object diff
[x] ipfs object links
[x] ipfs object stat
[x] ipfs dag get
- [ ] deal with dag get not using core api currently
[x] ipfs dag resolve
- [ ] deal with dag resolve not using core api currently
[ ] Only provide when fetching data if the data comes from the network (e.g. ipfs ls a cid that you do not have in your local repo)

Can possibly merge at this point.

5. Change find providers to account for some cids not being provided

The purpose of this step is to modify the existing find providers system to work in the situation where the block being looked for is not provided. In the process of starting this work, we will need a strategy for providing only a subset in order to exercise the code that looks for providers. We can use a simplistic strategy like pin roots.

[ ] introduce simplistic strategy for not providing (announcing) all cids
[ ] do not get stuck in a loop finding and asking a provider for a provided cid when the non-provided cid you're looking for is not available from that provider

6-8. Introduce new, more efficient strategy

The purpose of this step is to introduce a providing strategy that more robustly solves our providing needs. This is optimal. This is likely many steps and will become clearer as we complete work and @Kubuxu completes his research/experiments. We will elaborate on this step in time.

Can possibly merge at this point.

9. Make the provider and reprovider consistent

The purpose of this step is to ensure we don't have inconsistencies between the provider and reprovider, e.g. a programmer not providing blocks for some operation, but those blocks being provided eventually by the reprovider anyway.

[ ] if something is not provided initially, don't reprovide it either
[ ] configure provide and reprovide strategies separately (yes?)
[ ] allow provide/reprovide to be configured with the same set of strategies (maybe?)

Branches

cc @keks @Kubuxu

hannahhoward commented 6 years ago

This seems like a good approach because if we can get to step 3, we get a speed boost to all commands that provide blocks, independent of finding a better providing strategy. But also, once we come up with such a strategy, it will be easier to introduce.

eingenito commented 6 years ago

I just want to understand 1 and 2; I don't think we can do a release of this work short of 3. Don't we have to be able to at least maintain the current behavior> So are 1 and 2 internal milestones on a longer lived branch on the way to a unified mechanism for re/providing? Or would they be turned with an experiment flag? That seems challenging given how complex the code might be at that intermediate point, but maybe that would actually be a good way to get to 3?

michaelavila commented 6 years ago

@eingenito that's correct. That's why I said this "The first few are so small they wouldn't result in PRs on their own, but I want to give some insight into how I'm approaching the problem."

eingenito commented 6 years ago

Totally make sense.

magik6k commented 5 years ago

1. Remove bitswap provide

The purpose of this step is to remove providing from bitswap and to rely solely on reprovide. This is not optimal.

I assume you plan on al least improving reprovider? Last time I looked at it it was barely providing 1-few nodes per minute, which isn't really useful, and relying on it for all bitswap (incoming data) providing is probably a bad idea. We also want to provide relevant data quickly after it has been fetched to be able to spread load over more peers in case some piece of content becomes popular.